Abstract
Signature files seem to be a promising access method for text and attributes. According to this method, the documents (or records) are stored sequentially in one file (“text file”), while abstractions of the documents (“signatures”) are stored sequentially in another file (“signature file”). In order to resolve a query, the signature file is scanned first, and many nonqualifying documents are immediately rejected. We develop a framework that includes primary key hashing, multiattribute hashing, and signature files. Our effort is to find the optimal signature extraction method. The main contribution of this paper is that we present optimal and efficient suboptimal algorithms for assigning words to signatures in several environments. Another contribution is that we use information theory, and study the relationship of the false drop probability Fd and the information that is lost during signature extraction. We give tight lower bounds on the achievable Fd and show that a simple relationship holds between the two quantities in the case of optimal signature extraction with uniform occurrence and query frequencies. We examine hashing as a method to map words to signatures (instead of the optimal way), and show that the same relationship holds between Fd and loss, indicating that an invariant may exist between these two quantities for every signature extraction method. © 1987, ACM. All rights reserved.
Author supplied keywords
Cite
CITATION STYLE
Faloutsos, C., & Christodoulakis, S. (1987). Optimal Signature Extraction and Information Loss. ACM Transactions on Database Systems (TODS), 12(3), 395–428. https://doi.org/10.1145/27629.214285
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.