Concept

Language model pre-training for leveraging monolingual data for Low-Resource NMT

Pre-training language models can be very helpful for NMT, especially in the low-resource setting. Language model pre-training initializes the NMT models with language understanding and generation capability using only monolingual data. It is possible either to train the encoder and/or decoder separately, or jointly. In terms of separate pre-training, you can separately initialize the encoder and the decoder with language models and then fine-tune with supervised parallel data. Its also possible to initialize the encoder and decoder with separate language models training by a combination of masked language modeling, where some tokens in the text are masked and the model learns to predict the masked tokens. Translation language modeling extends masked language models by concatenating parallel sentence pairs as the input sentences. A drawback of separately training encoder and decoder is that it cannot train the encoder-decoder-attention well, which is very important in NMT to connect the source and target representations for translation. Therefore, jointly pre-training the encoder, decoder, and attention is better for translation accuracy. This can be done by using masked sequence to sequence learning that randomly masks a fragment (several consecutive tokens) in the input sentence of the encoder, and predicts the masked fragment in the decoder. We can also add noises and randomly mask tokens in the input sentences in the encoder, and learn to reconstruct the original text in the decoder. T5, for example, randomly masks some tokens and replaces the consecutive tokens with a single sentinel token.

0

1

Updated 2022-05-29

Contributors are:

Who are from:

Tags

Deep Learning (in Machine learning)

Data Science