Example

LectureBank-Full Generation Diagnostic: Token-F1 1.9 → 18.3, EM Stays 0.0

On LectureBank-Full, the hierarchical baseline improves token-level F1_1 from 1.91.9 to 18.318.3 relative to heuristic concatenation of retrieved passages, but lexical exact match remains 0.00.0. The large F1_1 gain indicates that the hierarchical retriever feeds substantially more of the gold answer tokens into the generator, while the zero EM reflects that the generator's surface form does not match the gold reference verbatim. Because generation is treated only as a context-quality diagnostic, the F1_1 improvement is read as evidence that the hierarchical context preserves more answerable evidence, not as a headline claim about generation quality.

0

1

Updated 2026-05-16

Contributors are:

Who are from:

Tags

Science

Auditable Strict-Parity Evaluation of Prerequisite-Graph Retrieval for RAG under Leakage Controls