Training Non-English NLP Models with English Training Data

Superb AI

Superb AI

2020/2/8 · 9 min read

Hi everyone! My name is Channy Hong, and I am a junior at Harvard College studying computer science. This past summer, I had the privilege of interning at Superb AI (a YC-backed startup) to conduct NLP research alongside my mentor, Jaeyeon Lee, and supervisor, Jung Kwon Lee. My internship experience ultimately culminated into a research paper titled “Unsupervised Interlingual Semantic Representations from Sentence Embeddings for Zero-Shot Cross-Lingual Transfer” and its submission to the Association for the Advancement of Artificial Intelligence (AAAI). Excitingly, our work was recently accepted for presentation at the main technical program, and this blog post serves as a high-level overview of the paper and our work in general.

MOTIVATION. Let’s dive right in. Given the fact that most NLP research today is conducted in English, non-English languages generally tend to lack labeled datasets that can be used for training models. However, the need for practical applications that incorporate NLP models exist in every language, and this mismatch presents a significant roadblock in reproducing state-of-the-art task solving models in non-English languages.

NATURAL LANGUAGE INFERENCE. For example, let us consider an application that analyzes a ‘hypothesis’ sentence, when given a ‘premise’ sentence. Perhaps this analysis consists of classifying the hypothesis as either one of ‘entailment’, ‘neutral’, or ‘contradiction’, based on its semantic relationship with the premise. This is the specification of the natural language inference (NLI) task, which also happens to be a task commonly used for benchmarking the effectiveness of a language encoder (e.g. BERT, XLNet).

Perhaps it is easiest to explain the differences between the three classifications (entailment / neutral / contradiction) by beginning with contradiction:

Contradiction: if “a man” is “inspect[ing] the uniform of a figure in some East Asian country” (premise), then necessarily, “the man” cannot currently be “sleeping” (hypothesis). Thus, this relationship is a contradiction.

Entailment: if there is “a soccer game with multiple males playing” (premise), then necessarily, “some men are playing a sport” (hypothesis). Thus, this relationship is an entailment.

Neutral: if there is “an older and younger man smiling” (premise), then we cannot be sure whether “two men are smiling and laughing at the cats playing on the floor” (hypothesis), necessarily. Thus, this relationship is neutral.

It is easy to discern that NLI is quite a challenging task that requires a thorough understanding of the actual semantics of the sentences in order to solve (unlike the sentiment analysis task for instance, which may be able to yield satisfactory results whilst solely relying on being able to pick up on certain keywords).

CLASSIFIER. Now, let’s think about developing a model that solves this NLI task. If there is sufficient training data (i.e. sufficient number of ‘training examples’, each example consisting of a premise sentence, a hypothesis sentence and the correct ‘label’), we can use well-known encoders such as Google’s BERT to produce semantically rich vector-form representations (i.e. embedding) of the premise & hypothesis sentences and feed them (concatenated) to a neural network, constantly updating (i.e. training) it to be able to ‘discern’ the relationship between the two sentences and (hopefully) output the correct answer (one of entailment / neutral / contradiction).

Again, it’s easy enough to train an effective neural network for the English NLI task, because we have enough training data for it. In particular, we can utilize the SNLI and MNLI datasets, with 570k and 390k training examples respectively. In our study, we utilized the MNLI dataset, converting its sentence strings into 768-dimensional embeddings using the bert-as-service library (using its default settings), and feeding the concatenated premise+hypothesis embeddings to our go-to classifier neural network (of a single hidden layer of size 768, then outputting a 3-way confidence score).

Classifier training paradigm: when the concatenated premise and hypothesis sentence embeddings are passed into the neural network (our go-to classifier), the 3-way confidence scores are outputted and the prediction (the argmax) and the ‘label’ (the correct answer) are compared to make updates to the weights and biases of our go-to classifier neural network.

Our go-to classifier achieves an accuracy rate of 63.8% on the MNLI task. The accuracy rate was calculated by counting the number of correct predictions on the 5000 test examples of the MNLI dataset. This 63.8% (around 3200 examples correctly predicted) is the baseline of the baseline, since we simply trained our go-to classifier neural network with sufficient amount of training examples.

XNLI. Okay, what if now we are given non-English NLI tasks? The Cross Lingual NLI (XNLI) Corpus team at NYU has equipped us with a robust suite in 14 non-English languages, consisting each of 2,500 development examples and 5,000 test examples (manual translations of the MNLI English versions). To solve them, it would be easiest if we could just download training examples (preferably in the order of hundreds of thousands examples) for the language-of-interest and then train our go-to classifier, just as we did for English. But we cannot, since such training examples do not exist. Hmm…

TRANSLATE-TRAIN. Then, how about we just Google Translate the English MNLI training examples into the language-of-interest, then train our go-to classifier model using that? That could certainly work, and the XNLI team have conveniently released such ‘neural machine-translated’ versions (445MB) of the 390k English training examples for use as well (again, for the 14 non-English languages).

For our work, we chose as our languages-of-interest Spanish, German, Chinese, and Arabic — the former two being linguistically close to English and the latter two linguistically distant. Again, using the bert-as-service library (BERT-multilingual can produce vector form representation of sentence strings from 104 languages, including the above four plus English) we can achieve the following accuracy rates within this translate-train paradigm:

Translate-Train results on the XNLI test suite.

The neural machine translation (NMT) system that the XNLI team used massive amounts of parallel (with English) data — semantically equivalent sentences paired together (i.e. “I am hungry” in English and “j’ai faim” in French paired together) — for its own training. However, it is easy to see that for some of the less major languages such as Urdu, Polish, or Tamil, their parallel (with English) dataset may be lacking in both quantity and quality, while those language may very well be well-endowed with monolingual corpora (such as Wikipedia dumps).

Some possible routes for obtaining parallel data include widely translated texts (such as the Bible or the Quran) or projects like OpenSubtitles, but the parallel datasets obtained through these routes do not tend to include the full expressiveness of each language (i.e. limited mostly to dialogues, exaggerated use of archaic expressions, etc.), which inadvertently has adverse effects on the quality of the translation. Thus, the above table is yet another baseline we are going to use as sort of the optimistic model performance: “With abundant & quality parallel data — since parallel (with English) data can be easy enough to find for these 4 major languages — we can achieve this much.”

INTERLINGUAL SEMANTIC REPRESENTATIONS. Okay, but instead of translating the training examples from English to language of interest, what if we instead first ‘translate’ them into some form of semantic representations that are independent of language-specific syntaxes and specifications, then again ‘translate’ into the target language? Once we become confident enough that these intermediate representations are semantically transitive (i.e. not lose semantic information of the original sentence), then we can train our go-to classifier on top of them — instead of having to translate all the way to the target language.

This line of thinking is feasible, because if we really think about what our go-to classifier neural network is doing, it is picking up on certain semantic features of the English sentence embeddings. Thus, we can imagine that if we can similarly distill sentences into interlingual, semantically-rich vector form, an effective classifier will still be able to pick up on the semantic information encoded within it. The benefit of this abstracted approach is that we now no longer have to assume that English is our de facto base language. It can be useful to have a general framework in which training data from multiple languages can be simultaneously used. Perhaps, English in the future will no longer be the solely dominant source of training data.

Of course, the idea is nice and dandy, but the real challenge is coming up with an encoder that can ‘encode’ sentence embeddings (such as that produced by BERT) into interlingual semantic representations (ISR) on top of which our go-to classifier can be trained. Below is what our desired classifier training paradigm would look like.

Classifier training paradigm with the ISR encoder fitted in.

ISR ENCODER. Now, onto coming up with (i.e. training) the ISR encoder. In particular, for our ISR encoder to produce semantic representations of sentences, it should on inputs “I am hungry [English]” and “J’ai faim [French]”, output proximal (similar) ISR embeddings since the two sentences carry proximal semantic information.

In accordance with the idea of translate-train, one way of achieving this feat would be to utilize a large quantity of parallel data to have our encoder learn to map semantically equivalent sentences in different languages to proximal embedding spaces. But once again, we are not interested in utilizing parallel data given its infeasibility of use for many low-resource languages. * Then, how can we train such an encoder in an unsupervised fashion — solely using monolingual corpora? *

INSPIRATION. Surprisingly, our main inspiration comes from the realm of computer vision. ‘Translation’ in computer vision usually amounts to devising a method of translating an input from its original domain to target domain, and the popular unsupervised method is to have the encoder see a large quantity of instances of every domain such that it develops a certain understanding of the distribution of the instances of each domain. The forerunning use case of this idea is CycleGAN, in which training solely with unpaired data suffices for an effective translation system.

Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks

From Zhu et al. “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”.

The core concept behind CycleGAN is to quasi-simulate parallel data by generating (i.e. translating) a supposed translation of the input image into target domain. This generation process is induced to favor translations that fit well within the overall distribution of the target domain (thus rendering it as truly belonging to the target domain). Then, this supposed translation is translated back into the original domain, and this ‘reconstructed’ image is compared (and induced to be similar) with the original input image. This is the ‘cycle consistency loss’: checking whether or not the core contents — or semantics, in the case of NLP — are preserved during this reconstruction process (we will be referring to this ‘cycle consistency loss’ as reconstruction loss from here).

Once again, however, we are not interested in translating our sentences from one domain to another all the way, but rather in capturing certain qualities for our intermediate ISR. In particular, we want to make sure that both semantic transitivity and language-agnosticity of our ISR are preserved, such that ISR generated from an English sentence should be proximal to the ISR generated from a semantically equivalent French sentence, and so on.

ISR CONSISTENCY LOSS. Addressing this concern, we again utilized the fact that we are quasi-simulating parallel datasets (by way of translated sentences) and thus introduced the ISR consistency loss in which the ISR of the forward translation and the ISR of the backward translation are compared (and induced to be similar). The idea here is that with semantic transitivity preserved via additional layer of a consistency loss, the ISR generated from any language — be it from the original input domain, or the target once-translated domain — must be proximal if the sentences are semantically equivalent.

Note that our reconstruction loss is analogous to CycleGAN’s cycle consistency loss. Also note that our encoder and decoder both take in domain label (original and target, respectively), thus we can assume in big picture that there exists disparate encoder & decoder for each language (where one gets activated over others according to the trailing one-hot section of the input vector) with some shared weights and biases (since there bounds to exist significant similarities between encoder & decoder of every language). This notion of disparate encoder & decoder is analogous to the disparate F and G from the CycleGAN paper.

There are additional details such as the adversarial loss and the domain classification loss that are integral pieces to the workings of this framework, but we leave it to the readers to read up more about them in our paper.

TRAINING DETAILS. In our work, we scraped 400k sentences each from Wikipedia dumps of the 5 languages (English, Spanish, German, Chinese, Arabic) used, then trained our entire framework (which includes the ISR encoder) for around 60 hours on a single Tesla T4 GPU until we were sufficiently confident in the semantic transitivity of the translated sentences.

With our ISR encoder trained and fixed, we then ran training of our go-to classifier model using the English training data. To recap:

  • the sentences of the English training examples were converted into semantic rich representations by BERT-multilingual,

  • then to ISR by our ISR encoder,

  • and our classifier was trained on top of that.

RESULTS. And finally, we performed zero-shot cross-lingual transfer (i.e. training with English training examples and testing with non-English test examples), where the test examples are converted into semantic rich representations by BERT-multilingual then to ISR by our ISR encoder, and their concatenation is fed into the classifier to output a prediction. We report the results of the zero-shot cross-lingual transfer below:

The BERT English model above is simply a classifier trained directly on top of embeddings created by BERT-multilingual on English training examples. Consistent with the findings by Pires et al., BERT-multilingual demonstrates already strong zero-shot transfer performance, while we seek to build our framework on top of it and demonstrate stronger performance than this already solid baseline.

* Note that our framework yields trailing yet comparable results to the Translate-Train baseline. While the absolute performance of the ISR encoder seems yet to lack the empirical robustness for widespread use in industry applications, we nonetheless believe that this result displays a strong case for the overall feasibility of this unsupervised method of zero-shot cross-lingual transfer. *

ABLATION STUDIES. We were able to confirm the effectiveness of the design choices of our framework, by training our ISR encoder absent in one of domain classification loss and ISR consistency loss, then plotting the t-SNE visualizations of 1000 sets of parallel sentences of the 5 languages (from the XNLI development suite).

t-SNE visualization of 1000 parallel sentences from the English, Spanish, German, Chinese, and Arabic XNLI development sets. Colors correspond to languages: English as teal, Spanish as blue, German as yellowish-green, Chinese as purple, and Arabic as red.

As shown above, the inclusion of domain classification loss and ISR consistency loss contributed significantly to the overall distribution matching of the ISR from each language.

We further elucidated the effects of our design choices on semantic transitivity by drawing edges between a subset of semantically parallel sentences. The below figure shows that ISR consistency loss contributed significantly to the ‘permutational alignment’ of the parallel sentences, inducing semantically proximal (i.e. parallel) sentences to be plotted in proximal spaces.

t-SNE visualization of the sentences from the previous figure, with a subset of sentences highlighted and edges drawn between semantically parallel sentences.

We were also able to confirm the above quantitatively by way of evaluating both encoder’s zero-shot cross-lingual transfer capabilities, as shown in the table below.

Zero-shot cross-lingual transfer results of the ISR encoder and the encoder trained w/o ISR consistency loss on XNLI test suite.

CONCLUSION. We believe that the logical next step in this direction of study is extending the application of this framework to the language encoders that produce word-level — rather than sentence-level — embeddings. Although just a small step, we hope that our work opens a door in a novel, scalable direction in which this problem of lack of data in low-resource language can be addressed; and in that spirit, the code implementations we have used in this work and the instructions can be found here.

Thank you very much for reading!