We study the problem of efficiently removing equal frequency n -gram substrings from an n -gram set, formally called Statistical Substring Reduction (SSR). SSR is a useful operation in corpus based multi-word unit research and new word identification task of oriental language processing. We present a new SSR algorithm that has linear time ( O ( n )) complexity, and prove its equivalence with the traditional O ( n 2 ) algorithm. In particular, using experimental results from several corpora with different sizes, we show that it is possible to achieve performance close to that theoretically predicated for this task. Even in a small corpus the new algorithm is several orders of magnitude faster than the O ( n 2 ) one. These results show that our algorithm is reliable and efficient, and is therefore an appropriate choice for large scale corpus processing.
CITATION STYLE
Yang, X., Zhou, G., Su, J., & Tan, C. L. (2005). Natural Language Processing – IJCNLP 2004. (K.-Y. Su, J. Tsujii, J.-H. Lee, & O. Y. Kwong, Eds.) (Vol. 3248, pp. 22–31). Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/b105612
Mendeley helps you to discover research relevant for your work.