Assessing multivariate Bernoulli models for Information Retrieval

Although the seminal proposal to introduce Language Modeling in Information Retrieval was based on a multi-variate Bernoulli distribution, the predominant modeling approach is now centered on Multinomial models. Language modeling for retrieval based on multi-variate Bernoulli distributions is seen to be inefficient and believed to be less effective than the Multinomial model. In this paper, we examine the multi-variate Bernoulli model with respect to its successor and examine its role in future retrieval systems. In the context of Bayesian Learning both modeling approaches are described, contrasted and compared theoretically and computationally. We show that the query likelihood following a multi-variate Bernoulli distribution introduces interesting retrieval features which may be useful for specific retrieval tasks such as sentence retrieval. Then, we address the efficiency aspect and show that algorithms can be designed to perform retrieval efficiently for multi-variate Bernoulli models, before performing an empirical comparison to study the behaviorial aspects of the models. A series of comparisons is then conducted on a number of test collections and retrieval tasks to determine the empirical and practical differences between the different models. Our results indicate that for sentence retrieval the multi-variate Bernoulli model can significantly outperform the Multinomial model. However, for the other tasks the Multinomial model provides consistently better performance (and in most cases significantly so). An analysis of the various retrieval characteristics reveals that the multi-variate Bernoulli model tends to promote long documents whose non-query terms are specific. This appears to be detrimental to the task of document retrieval but it is valuable for other tasks such as sentence retrieval.

keywords: