$$\delta (x)$$ is the performance difference between A and B on a particular test set x. This performance can be accuracy or F-measure. A bigger δ means that A seems to be way better than B; a small δ means A seems to be only a little better.

$$\delta (X)$$ is the performance difference between A and B on X, which ranges over all test sets.

$$H_{0}$$ is the null hypothesis, meaning that A is not better than B.

p-value is the probability, assuming the null hypothesis $$H_{0}$$ is true, of seeing the δ (x) that we saw or one even greater.


University of Colorado at Boulder

To compare the performance of classifiers of A and B:

$$P\left ( \delta  (X ) \geq  \delta  (x)|H_{0}\displaystyle \ is \ true \right ) $$

A very small p-value (smaller than 0.05 or 0.01) indicates that we can reject the null hypothesis. Thus, the result that A is better than B is statistically significant.



Formula of Statistical Significance Tests

An on-going but a helpful book resource about NLP
https://web.stanford.edu/~jurafsky/slp3/

Learn Before

Related