Evaluation is crucial in Information Retrieval. The Cranfield
paradigm allows reproducible system evaluation by fostering
the construction of standard and reusable benchmarks.
Each benchmark or test collection comprises a set of queries,
a collection of documents and a set of relevance judgements.
Relevance judgements are often done by humans and thus
expensive to obtain. Consequently, relevance judgements
are customarily incomplete. Only a subset of the collection,
the pool, is judged for relevance. In TREC-like campaigns,
the pool is formed by the top retrieved documents supplied
by systems participating in a certain evaluation task. With
multiple retrieval systems contributing to the pool, an exploration/exploitation
trade-off arises naturally. Exploiting
effective systems could find more relevant documents, but
exploring weaker systems might also be valuable for the
overall judgement process. In this paper, we cast document
judging as a multi-armed bandit problem. This formal modelling
leads to theoretically grounded adjudication strategies
that improve over the state of the art. We show that simple
instantiations of multi-armed bandit models are superior to
all previous adjudication strategies.
Keywords: Information Retrieval, Evaluation, Pooling, Reinforcement Learning, Multi-armed bandits