Learn Before
Paired Bootstrap Resampling for Significance Testing
Paired bootstrap resampling adapts general bootstrap resampling to compare two systems evaluated on the same set of items (e.g., the same test questions). Given paired per-item scores for systems and under the same metric, one draws resamples by sampling item indices with replacement; on each resample , both systems' metrics are recomputed on exactly the same resampled indices and a paired delta is recorded. The empirical distribution of is used to form a percentile confidence interval for the true paired difference (e.g., the th and th percentiles for a 95% CI) and to derive a one-sided -value as the fraction of resamples in which the delta has the opposite sign. A delta is treated as significant when its percentile CI excludes . Pairing the resampled index set for both systems removes between-item variance and is the standard significance protocol for comparing two systems on a shared evaluation set in NLP.
0
1
Tags
Data Science
Auditable Strict-Parity Evaluation of Prerequisite-Graph Retrieval for RAG under Leakage Controls
Science