The character F-score (chrF) metric has several limitations. It is very local, meaning that a large phrase moved to a different position might barely change the chrF score. Furthermore, chrF cannot evaluate cross-sentence properties of a document, such as its discourse coherence. chrF and similar automatic overlap metrics also perform poorly when comparing fundamentally different systems, such as human-aided translation versus machine translation, or distinct machine translation architectures. Therefore, these metrics are most appropriate for evaluating changes within a single system.

Limitations of the chrF Evaluation Method

The simplest and most robust metric for MT evaluation is called chrF, which stands for character F-score. Consider a test set from a parallel corpus, in which each source sentence has both a gold human target translation and a candidate MT translation we’d like to evaluate. The chrF metric ranks each MT target sentence by a function of the number of character n-gram overlaps with the human translation. (see pic)
Character or word overlap-based metrics like chrF are mainly used to compare two systems, with the goal of answering questions like: did the new algorithm we just invented improve our MT system?


University of Michigan - Ann Arbor

Translations are evaluated along two dimensions:
1. adequacy: how well the translation captures the exact meaning of the source sentence. 
 2. fluency: how fluent the translation is in the target language.

The most accurate evaluations use human raters. An alternative is to do ranking: give the raters a pair of candidate translations, and ask them which one they prefer.
While humans produce the best evaluations of machine translation output, running a
human evaluation can be time consuming and expensive. For this reason automatic
metrics are often used. 


 MT Evaluation

An on-going but a helpful book resource about NLP
https://web.stanford.edu/~jurafsky/slp3/

Learn Before

Related

Learn After