The empirical nature of Information Retrieval (IR) mandates strong experimental practices. A keystone of such experimental practices is the Cranfield evaluation paradigm. Within this paradigm, the collection of relevance judgments has been the subject of intense scientific investigation. This is because, on one hand, consistent, precise, and numerous judgements are keys to reducing evaluation uncertainty and test collection bias; on the other hand, however, relevance judgements are costly to collect. The selection of which documents to judge for relevance, known as pooling method, has therefore a great impact on IR evaluation. In this paper we focus on the bias introduced by the pooling method, known as pool bias, which affects the reusability of test collections, in particular when building test collections with a limited budget. In this paper we formalize and evaluate a set of 22 pooling strategies based on: traditional strategies, voting systems, retrieval fusion methods, evaluation measures, and multi-armed bandit models. To do this we run a large-scale evaluation by considering a set of 9 standard TREC test collections, in which we show that the choice of the pooling strategy has significant effects on the cost needed to obtain an unbiased test collection. We also identify the least biased pooling
strategy in terms of pool bias according to three IR evaluation measures: AP, NDCG, and P@10.
Keywords: Pooling Method, Test Collections, Pool Bias