Self-indexing is a concept developed for indexing arbitrary strings. It has been enormously successful to reduce the size of the large indexes typically used on strings, namely suffix trees and arrays. Self-indexes represent a string in a space close to its compressed size and provide indexed searching on it. On natural language, a compressed inverted index over the compressed text already provides a reasonable alternative, in space and time, for indexed searching of words and phrases. In this paper we explore the possibility of regarding natural language text as a string of words and applying a self-index to it. There are several challenges involved, such as dealing with a very large alphabet and detaching searchable content from non-searchable presentation aspects in the text. As a result, we show that the self-index requires space very close to that of the best word-based compressors, and that it obtains better search time than inverted indexes (using the same overall space) when searching for phrases. © 2009 Springer Berlin Heidelberg.
CITATION STYLE
Brisaboa, N. R., Fariña, A., Navarro, G., Places, A. S., & Rodríguez, E. (2008). Self-indexing natural language. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 5280 LNCS, pp. 121–132). Springer Verlag. https://doi.org/10.1007/978-3-540-89097-3_13
Mendeley helps you to discover research relevant for your work.