Automatic information retrieval systems have to deal with documents of varying lengths in a text collection. Document length normalization is used to fairly retrieve documents of all lengths. In this study, we observe that a normalization scheme that retrieves documents of all lengths with similar chances as their likelihood of relevance will outperform another scheme which retrieves documents with chances very different from their likelihood of relevance. We show that the retrieval probabilities for a particular normalization method deviate systematically from the relevance probabilities across different collections. We present pivoted normalization, a technique that can be used to modify any normalization function thereby reducing the gap between the relevance and the retrieval probabilities. Training pivoted normalization on one collection, we can successfully use it on other (new) text collections, yielding a robust, collection independent normalization technique. We use the idea of pivoting with the well known cosine normalization function. We point out some shortcomings of the cosine function and present two new normalization functions - pivoted unique normalization and pivoted byte size normalization.
CITATION STYLE
Singhal, A., Buckley, C., & Mitra, M. (1996). Pivoted document length normalization. In SIGIR Forum (ACM Special Interest Group on Information Retrieval) (pp. 21–29). https://doi.org/10.1145/243199.243206
Mendeley helps you to discover research relevant for your work.