1Cademy - MOOC-CS Generation Diagnostic: EM/ROUGE-L/BLEU = 0.0, Token-F1 1.7 → 4.0 Due to Terse Bilingual Labels

Learn Before

Generation as Context-Quality Diagnostic, Not a Headline Claim

Example

MOOC-CS Generation Diagnostic: EM/ROUGE-L/BLEU = 0.0, Token-F1 1.7 → 4.0 Due to Terse Bilingual Labels

On MOOC-CS, end-to-end generation diagnostics show that exact match, ROUGE-L, and BLEU remain 0.0, and token-level F $_1$ rises only from 1.7 to 4.0 when moving from heuristic concatenation to the hierarchical baseline. The paper attributes the low scores to a reference-style artifact: the gold answers are terse bilingual concept labels, so surface-form generation metrics have very little room to score even when the retrieved context is informative. The numbers are therefore not used to argue against the retriever, only to flag that MOOC-CS references make generation metrics uninformative.

0

1

Updated 2026-05-18

Contributors are:

Who are from:

References

Reference: Auditable Strict-Parity Evaluation of Prerequisite-Graph Retrieval for RAG under Leakage Controls

Learn Before

Related