Natural Language Processing – IJCNLP 2004

  • Yang X
  • Zhou G
  • Su J
  • et al.
N/ACitations
Citations of this article
46Readers
Mendeley users who have this article in their library.
Get full text

Abstract

We study the problem of efficiently removing equal frequency n -gram substrings from an n -gram set, formally called Statistical Substring Reduction (SSR). SSR is a useful operation in corpus based multi-word unit research and new word identification task of oriental language processing. We present a new SSR algorithm that has linear time ( O ( n )) complexity, and prove its equivalence with the traditional O ( n 2 ) algorithm. In particular, using experimental results from several corpora with different sizes, we show that it is possible to achieve performance close to that theoretically predicated for this task. Even in a small corpus the new algorithm is several orders of magnitude faster than the O ( n 2 ) one. These results show that our algorithm is reliable and efficient, and is therefore an appropriate choice for large scale corpus processing.

Cite

CITATION STYLE

APA

Yang, X., Zhou, G., Su, J., & Tan, C. L. (2005). Natural Language Processing – IJCNLP 2004. (K.-Y. Su, J. Tsujii, J.-H. Lee, & O. Y. Kwong, Eds.) (Vol. 3248, pp. 22–31). Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/b105612

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free