Most significant substring mining based on chi-square measure

Sourav Dutta; Arnab Bhattacharya

Conference Proceedings

Most significant substring mining based on chi-square measure

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2010) 6118 LNAI(PART 1) 319-327

DOI: 10.1007/978-3-642-13657-3_35

8Citations

8Readers

Get full text

Abstract

Given the vast reservoirs of sequence data stored worldwide, efficient mining of string databases such as intrusion detection systems, player statistics, texts, proteins, etc. has emerged as a great challenge. Searching for an unusual pattern within long strings of data has emerged as a requirement for diverse applications. Given a string, the problem then is to identify the substrings that differs the most from the expected or normal behavior, i.e., the substrings that are statistically significant (i.e., less likely to occur due to chance alone). To this end, we use the chi-square measure and propose two heuristics for retrieving the top-k substrings with the largest chi-square measure. We show that the algorithms outperform other competing algorithms in the runtime, while maintaining a high approximation ratio of more than 0.96. © 2010 Springer-Verlag Berlin Heidelberg.

Cite

CITATION STYLE

APA

Dutta, S., & Bhattacharya, A. (2010). Most significant substring mining based on chi-square measure. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 6118 LNAI, pp. 319–327). https://doi.org/10.1007/978-3-642-13657-3_35

Most significant substring mining based on chi-square measure

Abstract

Cite

Register to see more suggestions