LectureBank-Full Generation Diagnostic: Token-F1 1.9 → 18.3, EM Stays 0.0
On LectureBank-Full, the hierarchical baseline improves token-level F from to relative to heuristic concatenation of retrieved passages, but lexical exact match remains . The large F gain indicates that the hierarchical retriever feeds substantially more of the gold answer tokens into the generator, while the zero EM reflects that the generator's surface form does not match the gold reference verbatim. Because generation is treated only as a context-quality diagnostic, the F improvement is read as evidence that the hierarchical context preserves more answerable evidence, not as a headline claim about generation quality.
0
1
Tags
Science
Auditable Strict-Parity Evaluation of Prerequisite-Graph Retrieval for RAG under Leakage Controls
Related
LectureBank-Full R@10 Gain from Diffusion and Role-Aware Quotas
LectureBank-Full Configuration Used in Hierarchical Prerequisite RAG (208 Concepts, 899 Edges, 1,421 QA)
LectureBank-Full Target-Disjoint R@10 Result (n=164): Diffusion Gain Survives, Adaptive Tied
LectureBank-Full Generation Diagnostic: Token-F1 1.9 → 18.3, EM Stays 0.0
LectureBank-Full Error Taxonomy: Residual Misses Are Near-Misses Along the Local Prerequisite Graph
LectureBank-Full Paired Delta: Adaptive vs Hierarchical Baseline = +0.7 [-2.1, +3.6]
Token-Cap Comparison on LectureBank-Full: Adaptive Loses More as Cap Tightens
LectureBank-Full Tight-Budget Advantage of Adaptive Depth Gating (Mean ΔR@k = +2.13 over k∈{1,2,3,4})
LectureBank-Full ΔR@k Peaks at k=4 (+6.4 Points, CI [1.0, 11.7])
LectureBank-Full Diffusion Gain over Static Parent Expansion (~18 R@10 Points)
Bounded Held-Out Targets After Strictest Leakage Control (21 LectureBank-Full, 18 MOOC-CS)
LectureBank-Full Decomposition: Diffusion+Quotas Drive ~18 R@10 Points; Contrast Gating Adds At Most ~1 Point (Statistically Tied)
LectureBank-Full Generation Diagnostic: Token-F1 1.9 → 18.3, EM Stays 0.0
MOOC-CS Generation Diagnostic: EM/ROUGE-L/BLEU = 0.0, Token-F1 1.7 → 4.0 Due to Terse Bilingual Labels
QASC Generation Diagnostic: TF-IDF Multiple-Choice Scorer 76.8% (Hierarchical) vs 74.6% (Adaptive)