Yet another sorting-based solution to the reassignment of document identifiers

4Citations
Citations of this article
2Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Inverted file is generally used in search engines such as Web Search and Library Search, etc. Previous work demonstrated that the compressed size of inverted file can be significantly reduced through the reassignment of document identifiers. There are two main state-of-theart solutions: URL sorting-based solution, which sorts the documents by the alphabetical order of the URLs; and TSP-based solution, which considers the reassignment as Traveling Salesman Problem. These techniques achieve good compression, while have significant limitations on the URLs and data size. In this paper, we propose an efficient solution to the reassignment problem that first sorts the terms in each document by document frequency and then sorts the documents by the presence of the terms. Our approach has few restrictions on data sets and is applicable to various situations. Experimental results on four public data sets show that compared with the TSP-based approach, our approach reduces the time complexity from O(n2) to O(|D̄|·n log n) (|D̄|: average length of n documents), while achieving comparative compression ratio; and compared with the URL-sorting based approach, our approach improves the compression ratio up to 10.6% with approximately the same run-time. © Springer-Verlag 2012.

Cite

CITATION STYLE

APA

Shi, L., & Wang, B. (2012). Yet another sorting-based solution to the reassignment of document identifiers. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7675 LNCS, pp. 238–249). https://doi.org/10.1007/978-3-642-35341-3_20

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free