1Cademy - Exploiting comparable corpus for leveraging monolingual data for Low-Resource NMT

Learn Before

Methods for Leveraging Mono-Lingual Data for Low-Resource NMT

Concept

Exploiting comparable corpus for leveraging monolingual data for Low-Resource NMT

Monolingual data of different languages that refers to the same entity (e.g. Wikipedia pages in different languages describing the same object) can be considered comparable corpuses, which are easier to obtain than parallel data. Comparable corpora are strong supplements to parallel corpus, from which parallel sentences can be extracted based on language models or translation models. These corpuses contain implicit parallel information for NMT systems. The problem here is mining the parallel sentences from the comparable corpus. Cross-lingual sentence embeddings, extracting potential aligned target sentences given a source sentence and then making the target sentences better aligned with the source sentence by revising them via an editing mechanism, and self-supervised learning where finding semantically aligned sentences is considered an auxiliary task are ways to do this. In addition to mining parallel sentences, one can take advantage of the aligned topic distribution for weakly paired documents, which is suitable for documents related to the same event or entity but not implicitly aligned in sentences.

0

1

Updated 2022-05-29

Contributors are:

UA

Who are from:

References

A Survey on Low-Resource Neural Machine Translation

Learn Before

Related