The authors fine-tuned and trained mBART on monolingual data of many languages, using source-side word masking and replacement (BNaziotis et al., 2021) and trarget-side word-replacement noise (Voita et al., 2021). For noisy finetuning, the authors trained “mBART + mask” (masking 10% of source tokens), “mBART + replace (enc)” (replacing 10% source tokens with random ones), and “mBART + replace (dec)” (replacing 10% of target tokens with random ones). 


University of California, Berkeley

Authors use both randomly initialized and mBART initialized models in experiments. 


Monolingual Data Models


mBART Model Configuration 


Randomly initialized models use the same architecture as mBART. They are based on the Transformer architecture with 12 encoder and decoder layers, 1024 embedding size, and 16 self-attention heads. 


Random Model Configuration  


Models are optimized using Adam (Kingma and Ba, 2015). A dropout of 0.3, attention-dropout of 0.1, and label smoothing of 0.1 are applied. Each model is evaluated every 5K updates on the dev set and the one with the best BLEU is selected. 


Learn Before

Related