Concept

Limitations of the chrF Evaluation Method

The character F-score (chrF) metric has several limitations. It is very local, meaning that a large phrase moved to a different position might barely change the chrF score. Furthermore, chrF cannot evaluate cross-sentence properties of a document, such as its discourse coherence. chrF and similar automatic overlap metrics also perform poorly when comparing fundamentally different systems, such as human-aided translation versus machine translation, or distinct machine translation architectures. Therefore, these metrics are most appropriate for evaluating changes within a single system.

0

1

Updated 2026-05-01

Tags

Data Science