1Cademy - Reproducible IR Benchmarking and Evaluation-Variance Control

Learn Before

Research

Concept

Reproducible IR Benchmarking and Evaluation-Variance Control

Reproducible IR benchmarking and evaluation-variance control is a methodological tradition in information retrieval and adjacent ML evaluation that treats how a result is produced and reported as part of the result itself. Four influential strands define this tradition. (i) Reproducible reference implementations and statistical analyses on shared benchmarks: Kamalloo et al. (SIGIR 2024) release reference dense and sparse retrievers on BEIR, together with effect-size meta-analyses and an official leaderboard, so that compared systems share the same artifacts and statistical conventions. (ii) Improved reporting of experimental results: Dodge et al. (EMNLP 2019) show that test-set numbers alone are insufficient and prescribe reporting validation performance as a function of compute budget, since conclusions about which model is best can flip with hyperparameter search effort. (iii) Split design as a controlled variable: Gorman and Bedrick (ACL 2019) show that single standard train/test splits can produce unstable system rankings, and argue for randomized splits to test whether reported gains are reproducible across resamplings. (iv) Hidden benchmark assumptions: Dehghani et al. (2021) formalize the benchmark lottery, demonstrating that the choice of benchmark tasks (and other unstated reporting choices) can flip the apparent ordering of algorithms. Together these works establish that retrieval and ML claims must be reported with enough artifact, split, and configuration detail for variance and benchmark choices to be audited — the reporting line that downstream claim-level traceability discipline extends.