In this paper we study how to prioritize relevance assessments in the process of
creating an Information Retrieval test collection. A test collection consists of a set
of queries, a document collection, and a set of relevance assessments. For each
query, only a sample of documents from the collection can be manually assessed
for relevance. Multiple retrieval strategies are typically used to obtain such sample
of documents. And rank fusion plays a fundamental role in creating the sample by
combining multiple search results. We propose effective rank fusion models that
are adapted to the characteristics of this evaluation task. Our models are based on
the distribution of retrieval scores supplied by the search systems and our experiments
show that this formal approach leads to natural and competitive solutions
when compared to state of the art methods. We also demonstrate the benefits of
including pseudo-relevance evidence into the estimation of the score distribution
models.
Keywords: Rank Fusion, Information Retrieval, Evaluation, Pooling, Score Distributions, Pseudo-relevance