Self-indexing natural language

11Citations
Citations of this article
12Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Self-indexing is a concept developed for indexing arbitrary strings. It has been enormously successful to reduce the size of the large indexes typically used on strings, namely suffix trees and arrays. Self-indexes represent a string in a space close to its compressed size and provide indexed searching on it. On natural language, a compressed inverted index over the compressed text already provides a reasonable alternative, in space and time, for indexed searching of words and phrases. In this paper we explore the possibility of regarding natural language text as a string of words and applying a self-index to it. There are several challenges involved, such as dealing with a very large alphabet and detaching searchable content from non-searchable presentation aspects in the text. As a result, we show that the self-index requires space very close to that of the best word-based compressors, and that it obtains better search time than inverted indexes (using the same overall space) when searching for phrases. © 2009 Springer Berlin Heidelberg.

Cite

CITATION STYLE

APA

Brisaboa, N. R., Fariña, A., Navarro, G., Places, A. S., & Rodríguez, E. (2008). Self-indexing natural language. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 5280 LNCS, pp. 121–132). Springer Verlag. https://doi.org/10.1007/978-3-540-89097-3_13

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free