QASC Generation Diagnostic: TF-IDF Multiple-Choice Scorer 76.8% (Hierarchical) vs 74.6% (Adaptive)
On QASC, a deterministic TF-IDF multiple-choice scorer (no LLM judge) is used as a boundary-condition generation diagnostic. With the hierarchical baseline retrieved context the scorer reaches accuracy, and with adaptive retrieved context it reaches . The QASC generation check is interpreted only as a boundary-condition diagnostic of whether retrieved contexts preserve answerable evidence, not as a headline generation claim. The choice of a deterministic TF-IDF scorer keeps the diagnostic reproducible and independent of LLM-judge bias.
0
1
Tags
Science
Auditable Strict-Parity Evaluation of Prerequisite-Graph Retrieval for RAG under Leakage Controls
Related
QASC Directed Science Fact Graph Reconstruction (16,444 Nodes, 25,590 Edges)
QASC Strict-Parity Result: ColBERTv2/RePlug Strongest (R@10 = 85.0 [83.4, 86.6])
QASC Generation Diagnostic: TF-IDF Multiple-Choice Scorer 76.8% (Hierarchical) vs 74.6% (Adaptive)
QASC Conclusion: Reranking Beats Hierarchical and Adaptive Graph Traversal
QASC Paired Delta: Adaptive vs Hierarchical Baseline = +0.5 [-0.5, +1.5]
LectureBank-Full Generation Diagnostic: Token-F1 1.9 → 18.3, EM Stays 0.0
MOOC-CS Generation Diagnostic: EM/ROUGE-L/BLEU = 0.0, Token-F1 1.7 → 4.0 Due to Terse Bilingual Labels
QASC Generation Diagnostic: TF-IDF Multiple-Choice Scorer 76.8% (Hierarchical) vs 74.6% (Adaptive)