Estimating the number of substring matches in long string databases

1Citations
Citations of this article
4Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Estimating the number of substring matches is one of problems that estimate alphanumeric selectivity using statistical information for strings. In the context of alphanumeric selectivity estimation, a CS-tree (Count Suffix Tree), which is a variation of a suffix tree, has been used as a basic data structure to store statistical information for substrings. However, even though the CS-tree is useful to keep information about short strings such as name or title, the CS-tree has two drawbacks: one is that some count values that the CS-tree keeps can be incorrect, and the other is that it is almost impossible to build the CS-tree over long strings such as biological sequences. Therefore, for estimating the number of substring matches in long strings, we propose a CQ-tree (Count Q-gram Tree), which keeps the exact count values of all substrings of length q or below q located in the long strings, and can be constructed in one scan of data strings. Furthermore, on the basis of the CQ-tree, we return the lower and upper bounds that the number of occurrences of a query can reach to, together with the estimated count of the query pattern. These bounds are mathematically proved. To the best of our knowledge, our work is the first one that presents the lower and upper bounds among research activities about alphanumeric selectivity estimation. © Springer-Verlag Berlin Heidelberg 2005.

Cite

CITATION STYLE

APA

Bae, J., & Lee, S. (2005). Estimating the number of substring matches in long string databases. In Lecture Notes in Computer Science (Vol. 3399, pp. 145–156). Springer Verlag. https://doi.org/10.1007/978-3-540-31849-1_15

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free