Scalable detection of frequent substrings by grammar-based compression

Masaya Nakahara; Shirou Maruyama; Tetsuji Kuboyama; Hiroshi Sakamoto

Conference Proceedings

Scalable detection of frequent substrings by grammar-based compression

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2011) 6926 LNAI 236-246

DOI: 10.1007/978-3-642-24477-3_20

2Citations

1Readers

Get full text

Abstract

A scalable pattern discovery by compression is proposed. A string is representable by a context-free grammar (CFG) deriving the string deterministically. In this framework of grammar-based compression, the aim of the algorithm is to output as small a CFG as possible. Beyond that, the optimization problem is approximately solvable. In such approximation algorithms, the compressor by Sakamoto et al. (2009) is especially suitable for detecting maximal common substrings as well as long frequent substrings. This is made possible thanks to the characteristics of edit-sensitive parsing (ESP) by Cormode and Muthukrishnan (2007), which was introduced to approximate a variant of edit distance. Based on ESP, we design a linear time algorithm to find all frequent patterns in a string approximately and prove a lower bound for the length of extracted frequent patterns. We also examine the performance of our algorithm by experiments in DNA sequences and other compressible real world texts. Compared to the practical algorithm developed by Uno (2008), our algorithm is faster with large and repetitive strings. © 2011 Springer-Verlag.

Cite

CITATION STYLE

APA

Nakahara, M., Maruyama, S., Kuboyama, T., & Sakamoto, H. (2011). Scalable detection of frequent substrings by grammar-based compression. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 6926 LNAI, pp. 236–246). https://doi.org/10.1007/978-3-642-24477-3_20

Scalable detection of frequent substrings by grammar-based compression

Abstract

Cite

Register to see more suggestions