Example

MOOC-CS Generation Diagnostic: EM/ROUGE-L/BLEU = 0.0, Token-F1 1.7 → 4.0 Due to Terse Bilingual Labels

On MOOC-CS, end-to-end generation diagnostics show that exact match, ROUGE-L, and BLEU remain 0.00.0, and token-level F1_1 rises only from 1.71.7 to 4.04.0 when moving from heuristic concatenation to the hierarchical baseline. The paper attributes the low scores to a reference-style artifact: the gold answers are terse bilingual concept labels, so surface-form generation metrics have very little room to score even when the retrieved context is informative. The numbers are therefore not used to argue against the retriever, only to flag that MOOC-CS references make generation metrics uninformative.

0

1

Updated 2026-05-18

Contributors are:

Who are from:

Tags

Science

Auditable Strict-Parity Evaluation of Prerequisite-Graph Retrieval for RAG under Leakage Controls