1Cademy - Paired Bootstrap Resampling for Significance Testing

Learn Before

Bootstrap resampling algorithm

Activity (Process)

Paired Bootstrap Resampling for Significance Testing

Paired bootstrap resampling adapts general bootstrap resampling to compare two systems evaluated on the same set of items (e.g., the same test questions). Given $n$ paired per-item scores $(a_i, b_i)$ for systems $A$ and $B$ under the same metric, one draws $B$ resamples by sampling item indices with replacement; on each resample $b$ , both systems' metrics are recomputed on exactly the same resampled indices and a paired delta $\hat{\Delta}_b = M_A^{(b)} - M_B^{(b)}$ is recorded. The empirical distribution of $\{\hat{\Delta}_b\}_{b=1}^{B}$ is used to form a percentile confidence interval for the true paired difference (e.g., the 2.5th and 97.5th percentiles for a 95% CI) and to derive a one-sided $p$ -value as the fraction of resamples in which the delta has the opposite sign. A delta is treated as significant when its percentile CI excludes $0$ . Pairing the resampled index set for both systems removes between-item variance and is the standard significance protocol for comparing two systems on a shared evaluation set in NLP.

0

1

Updated 2026-05-18

Contributors are:

Who are from:

References

Reference: Statistical Significance Tests for Machine Translation Evaluation

Learn Before

Related

Learn After