Frequency consolidation among word N-grams: A practical procedure

Andreas Buerki

Conference ProceedingsOPEN ACCESS

Frequency consolidation among word N-grams: A practical procedure

Buerki A

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2017) 10596 LNAI 432-446

DOI: 10.1007/978-3-319-69805-2_30

4Citations

3Readers

Abstract

This paper considers the issue of frequency consolidation in lists of different length word n-grams (i.e. recurrent word sequences) extracted from the same underlying corpus. A simple algorithm – enhanced by a preparatory stage – is proposed which allows the consolidation of frequencies among lists of different length n-grams, from 2-grams to 6-grams and beyond. The consolidation adjusts the frequency count of each n-gram to the number of its occurrences minus its occurrences as part of longer n-grams. Among other uses, such a procedure aids linguistic analysis and allows the non-inflationary counting of word tokens that are part of frequent n-grams of various lengths, which in turn allows an assessment of the proportion of running text made up of recurring chunks. The proposed procedure delivers frequency consolidation and substring reduction among word n-grams and is independent of any particular method of n-gram extraction and filtering, making it applicable also in situations where full access to underlying corpora is unavailable.

Author supplied keywords

Cite

CITATION STYLE

APA

Buerki, A. (2017). Frequency consolidation among word N-grams: A practical procedure. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 10596 LNAI, pp. 432–446). Springer Verlag. https://doi.org/10.1007/978-3-319-69805-2_30

Frequency consolidation among word N-grams: A practical procedure

Abstract

Author supplied keywords

Cite

Register to see more suggestions