Testing the Tests: Simulation of Rankings to Compare Statistical Significance Tests in Information Retrieval Evaluation
Null Hypothesis Significance Testing (NHST) has been 
recurrently employed as the reference framework to 
assess the difference in performance between
Information Retrieval (IR) systems. 
IR practitioners customarily apply significance tests, such as 
the $t$-test, the Wilcoxon Signed Rank test, 
the Permutation test, the Sign test or the Bootstrap test. 
However, the question of which of these tests is the most 
reliable in IR experimentation is 
still controversial. Different authors have tried 
to shed light on this issue, but their conclusions are not 
in agreement. In this paper, we present a new methodology 
for assessing the behavior of significance 
tests in typical ranking tasks. Our method creates models 
from the search systems and uses those models to simulate 
different inputs to the significance tests. 
With such an approach, we can control the experimental 
conditions and run experiments with full knowledge about 
the truth or falseness of 
the null hypothesis. Following our methodology, we computed
a series of simulations that estimate the 
proportion of Type I and Type II errors made by different tests. 
Results conclusively suggest that the Wilcoxon test is the most reliable 
test and, thus, IR practitioners
should adopt it as the reference tool to assess differences
between IR systems.
keywords: Information Retrieval, Statistical Significance Tests, Evaluation
Publication: Congress
1624015060800
June 18, 2021
/research/publications/testing-the-tests-simulation-of-rankings-to-compare-statistical-significance-tests-in-information-retrieval-evaluation
Null Hypothesis Significance Testing (NHST) has been 
recurrently employed as the reference framework to 
assess the difference in performance between
Information Retrieval (IR) systems. 
IR practitioners customarily apply significance tests, such as 
the $t$-test, the Wilcoxon Signed Rank test, 
the Permutation test, the Sign test or the Bootstrap test. 
However, the question of which of these tests is the most 
reliable in IR experimentation is 
still controversial. Different authors have tried 
to shed light on this issue, but their conclusions are not 
in agreement. In this paper, we present a new methodology 
for assessing the behavior of significance 
tests in typical ranking tasks. Our method creates models 
from the search systems and uses those models to simulate 
different inputs to the significance tests. 
With such an approach, we can control the experimental 
conditions and run experiments with full knowledge about 
the truth or falseness of 
the null hypothesis. Following our methodology, we computed
a series of simulations that estimate the 
proportion of Type I and Type II errors made by different tests. 
Results conclusively suggest that the Wilcoxon test is the most reliable 
test and, thus, IR practitioners
should adopt it as the reference tool to assess differences
between IR systems. - Javier Parapar, David E. Losada, Álvaro Barreiro - 10.1145/3412841.3441945
publications_en