All models produce fewer errors when trained on joint split compared to zero split. Pretraining and upsampling idiom-train data helps all models. Masking increases errors on the joint split, and decoder-side word replacements yield a similar behavior in terms of LitTer. Adding word replacements in the encoder reduces LitTER. 


University of California, Berkeley

The en->fr are consistent with en->es. Global evalution using BLEU on diverse test sets and two targeted evaluation methods are considered. For the upsampling split, the idiom-train data is upsampled 20x, and models started to overfit with 100x upsampling. 


Idiom Model Results 


Focusing only on how models translate source-side idioms for targeted evaluation, the scores are macro-averaged to account for idiom frequency imbalance. Results are presented on LitTER metric and on APTEval, which provide different information. 


Targeted Evaluation Results 


Literal Translation Errors 


To estimate model accuracy, reference and hypothesis matches aligning with source idiom words are compared. Joint split improves accuracy, and pretraining outperforms random initialization. Upsampling doesn’t yield consistent improvements, and source-side word replacements reduce literal translation errors while degrading idiom translation accuracy. 


Idiom Translation Accuracy


mBART models are significantly better than random models, and there is no measurable difference between splits. Noisy finetuning methods do not improve results and encoder-side word replacements degrade overall performance. It is hypothesized that they induce hallucinations. 


General Purpose Results 


Global evaluation considers full sentences, which overshadows idiom translation impact. Models perform better when trained on joint split. 


Learn Before

Related