Concept

Bounded Benchmark Validity: Two Question Families and 21/18 Unique Held-Out Targets Cap Statistical Power

Even after stricter leakage controls, the benchmark-validity scope of this paper's headline retrieval claims is explicitly bounded. The prerequisite datasets instantiate only two question families, and the target-concept-disjoint test splits expose only 21 unique held-out targets on LectureBank-Full and 18 on MOOC-CS. With this few unique targets, paired-bootstrap intervals over per-target stability are necessarily wide and the statistical power to distinguish adaptive from fixed-depth hierarchical retrieval at R@10 is limited. The evidence therefore applies to curated, template-based prerequisite QA rather than broad educational or open-domain graph retrieval.

0

1

Updated 2026-05-18

Contributors are:

Who are from:

Tags

Science

Auditable Strict-Parity Evaluation of Prerequisite-Graph Retrieval for RAG under Leakage Controls