Affiliation:
1. Universidad de Santiago de Compostela, Spain
2. University of Glasgow, Scotland, UK
Abstract
Although the seminal proposal to introduce language modeling in information retrieval was based on a multivariate Bernoulli model, the predominant modeling approach is now centered on multinomial models. Language modeling for retrieval based on multivariate Bernoulli distributions is seen inefficient and believed less effective than the multinomial model. In this article, we examine the multivariate Bernoulli model with respect to its successor and examine its role in future retrieval systems. In the context of Bayesian learning, these two modeling approaches are described, contrasted, and compared both theoretically and computationally. We show that the query likelihood following a multivariate Bernoulli distribution introduces interesting retrieval features which may be useful for specific retrieval tasks such as sentence retrieval. Then, we address the efficiency aspect and show that algorithms can be designed to perform retrieval efficiently for multivariate Bernoulli models, before performing an empirical comparison to study the behaviorial aspects of the models. A series of comparisons is then conducted on a number of test collections and retrieval tasks to determine the empirical and practical differences between the different models. Our results indicate that for sentence retrieval the multivariate Bernoulli model can significantly outperform the multinomial model. However, for the other tasks the multinomial model provides consistently better performance (and in most cases significantly so). An analysis of the various retrieval characteristics reveals that the multivariate Bernoulli model tends to promote long documents whose nonquery terms are informative. While this is detrimental to the task of document retrieval (documents tend to contain considerable nonquery content), it is valuable for other tasks such as sentence retrieval, where the retrieved elements are very short and focused.
Funder
Ministerio de Educación, Cultura y Deporte
Xunta de Galicia
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Science Applications,General Business, Management and Accounting,Information Systems
Reference50 articles.
1. Frequentist and Bayesian Approach to Information Retrieval
2. Probabilistic models of information retrieval based on measuring the divergence from randomness
3. Azzopardi L. 2005. Incorporating context into the language modeling for ad hoc information retrieval. Ph.D. thesis University of Paisley Glasgow UK. Azzopardi L. 2005. Incorporating context into the language modeling for ad hoc information retrieval. Ph.D. thesis University of Paisley Glasgow UK.
4. An Efficient Computation of the Multiple-Bernoulli Language Model
Cited by
25 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献