Unsupervised Cross-lingual Representation Learning

  • Home
  • blog
  • Unsupervised Cross-lingual Representation Learning
blog image

Cross-lingual representation learning can be seen as an instance of transfer learning, similar to domain adaptation. The domains in this case are different languages.

Viewing cross-lingual learning as a form of transfer learning can help us understand why it might be useful and when it might fail:

  1. Transfer learning is useful whenever the scenarios between which we transfer share some underlying structure. In a similar vein, languages share commonalities on many levels—from loanwords and cognates on the lexical level, to the structure of sentences on the syntactic level, to the meaning of words on the semantic level. Learning about the structure of one language may therefore help us do better when learning a second language.
  2. Transfer learning struggles when the source and target settings are too different. Similarly, it is harder to transfer between languages that are dissimilar (e.g. that have different typological features in the World Atlas of Language Structures).

Similar to the broader transfer learning landscape, much of the work on cross-lingual learning over the last years has focused on learning representations of words. While there have been approaches that learn sentence-level representations, only recently are we witnessing the emergence of deep multilingual models.

The Digital Language Divide

The language you speak shapes your experience of the world. This is known as the Sapir-Whorf hypothesis and is contested. On the other hand, it is factual that the language you speak shapes your experience of the world online. What language you speak determines your access to information, education, and even human connections. In other words, even though we think of the Internet as being open to anyone, there is a digital language divide between dominant languages (mostly from the western world) and others.

As NLP technologies are becoming more commonplace, this divide extends to the technological level: At best, this means that non-English language speakers do not benefit from the latest features of digital assistants; at worst, algorithms are biased against or discriminate against speakers of non-English languages—we can see occurrences of this already today. To ensure that non-English language speakers are not left behind and at the same time to offset the existing imbalance, we need to apply our models to non-English languages.

1 Comment

  1. Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s, when an unknown printer took.

Comments are closed.