Yet another sorting-based solution to the reassignment of document identifiers

Liang Shi; Bin Wang

Conference Proceedings

Yet another sorting-based solution to the reassignment of document identifiers

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2012) 7675 LNCS 238-249

DOI: 10.1007/978-3-642-35341-3_20

4Citations

2Readers

Get full text

Abstract

Inverted file is generally used in search engines such as Web Search and Library Search, etc. Previous work demonstrated that the compressed size of inverted file can be significantly reduced through the reassignment of document identifiers. There are two main state-of-theart solutions: URL sorting-based solution, which sorts the documents by the alphabetical order of the URLs; and TSP-based solution, which considers the reassignment as Traveling Salesman Problem. These techniques achieve good compression, while have significant limitations on the URLs and data size. In this paper, we propose an efficient solution to the reassignment problem that first sorts the terms in each document by document frequency and then sorts the documents by the presence of the terms. Our approach has few restrictions on data sets and is applicable to various situations. Experimental results on four public data sets show that compared with the TSP-based approach, our approach reduces the time complexity from O(n2) to O(|D̄|·n log n) (|D̄|: average length of n documents), while achieving comparative compression ratio; and compared with the URL-sorting based approach, our approach improves the compression ratio up to 10.6% with approximately the same run-time. © Springer-Verlag 2012.

Author supplied keywords

Cite

CITATION STYLE

APA

Shi, L., & Wang, B. (2012). Yet another sorting-based solution to the reassignment of document identifiers. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7675 LNCS, pp. 238–249). https://doi.org/10.1007/978-3-642-35341-3_20

Yet another sorting-based solution to the reassignment of document identifiers

Abstract

Author supplied keywords

Cite

Register to see more suggestions