Probabilistic document length priors for language models

Roi Blanco; Alvaro Barreiro

Conference ProceedingsOPEN ACCESS

Probabilistic document length priors for language models

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2008) 4956 LNCS 394-405

DOI: 10.1007/978-3-540-78646-7_36

15Citations

10Readers

Get full text

Abstract

This paper addresses the issue of devising a new document prior for the language modeling (LM) approach for Information Retrieval. The prior is based on term statistics, derived in a probabilistic fashion and portrays a novel way of considering document length. Furthermore, we developed a new way of combining document length priors with the query likelihood estimation based on the risk of accepting the latter as a score. This prior has been combined with a document retrieval language model that uses Jelinek-Mercer (JM), a smoothing technique which does not take into account document length. The combination of the prior boosts the retrieval performance, so that it outperforms a LM with a document length dependent smoothing component (Dirichlet prior) and other state of the art high-performing scoring function (BM25). Improvements are significant, robust across different collections and query sizes. © 2008 Springer-Verlag Berlin Heidelberg.

Cite

CITATION STYLE

APA

Blanco, R., & Barreiro, A. (2008). Probabilistic document length priors for language models. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4956 LNCS, pp. 394–405). https://doi.org/10.1007/978-3-540-78646-7_36

Probabilistic document length priors for language models

Abstract

Cite

Register to see more suggestions