Activity (Process)

Paired Bootstrap Resampling for Significance Testing

Paired bootstrap resampling adapts general bootstrap resampling to compare two systems evaluated on the same set of items (e.g., the same test questions). Given nn paired per-item scores (ai,bi)(a_i, b_i) for systems AA and BB under the same metric, one draws BB resamples by sampling item indices with replacement; on each resample bb, both systems' metrics are recomputed on exactly the same resampled indices and a paired delta Δ^b=MA(b)MB(b)\hat{\Delta}_b = M_A^{(b)} - M_B^{(b)} is recorded. The empirical distribution of {Δ^b}b=1B\{\hat{\Delta}_b\}_{b=1}^{B} is used to form a percentile confidence interval for the true paired difference (e.g., the 2.52.5th and 97.597.5th percentiles for a 95% CI) and to derive a one-sided pp-value as the fraction of resamples in which the delta has the opposite sign. A delta is treated as significant when its percentile CI excludes 00. Pairing the resampled index set for both systems removes between-item variance and is the standard significance protocol for comparing two systems on a shared evaluation set in NLP.

0

1

Updated 2026-05-18

Contributors are:

Who are from:

Tags

Data Science

Auditable Strict-Parity Evaluation of Prerequisite-Graph Retrieval for RAG under Leakage Controls

Science