1Cademy - LectureBank-Full Generation Diagnostic: Token-F1 1.9 → 18.3, EM Stays 0.0

Learn Before

Example

LectureBank-Full Generation Diagnostic: Token-F1 1.9 → 18.3, EM Stays 0.0

On LectureBank-Full, the hierarchical baseline improves token-level $F_1$ from 1.9 to 18.3 relative to heuristic concatenation of retrieved passages, but lexical exact match remains 0.0. The large $F_1$ gain indicates that the hierarchical retriever feeds substantially more of the gold answer tokens into the generator, while the zero EM reflects that the generator's surface form does not match the gold reference verbatim. Because generation is treated only as a context-quality diagnostic, the $F_1$ improvement is read as evidence that the hierarchical context preserves more answerable evidence, not as a headline claim about generation quality.

0

1

Updated 2026-07-01

Contributors are:

Who are from:

References

Reference: Auditable Strict-Parity Evaluation of Prerequisite-Graph Retrieval for RAG under Leakage Controls

Learn Before

Related