Randomly initialized models share the same architecture as mBART: a Transformer with 12 encoder and decoder layers, 1024-dimensional embeddings, and 16 self-attention heads.

University of California, Berkeley

Google

Authors use both randomly initialized and mBART initialized models in experiments. 


Monolingual Data Models


The authors fine-tuned and trained mBART on monolingual data of many languages, using source-side word masking and replacement (BNaziotis et al., 2021) and trarget-side word-replacement noise (Voita et al., 2021). For noisy finetuning, the authors trained “mBART + mask” (masking 10% of source tokens), “mBART + replace (enc)” (replacing 10% source tokens with random ones), and “mBART + replace (dec)” (replacing 10% of target tokens with random ones). 


mBART Model Configuration 


Models are optimized using Adam (Kingma and Ba, 2015). A dropout of 0.3, attention-dropout of 0.1, and label smoothing of 0.1 are applied. Each model is evaluated every 5K updates on the dev set and the one with the best BLEU is selected. 


Learn Before

Related