This paper considers the issue of frequency consolidation in lists of different length word n-grams (i.e. recurrent word sequences) extracted from the same underlying corpus. A simple algorithm – enhanced by a preparatory stage – is proposed which allows the consolidation of frequencies among lists of different length n-grams, from 2-grams to 6-grams and beyond. The consolidation adjusts the frequency count of each n-gram to the number of its occurrences minus its occurrences as part of longer n-grams. Among other uses, such a procedure aids linguistic analysis and allows the non-inflationary counting of word tokens that are part of frequent n-grams of various lengths, which in turn allows an assessment of the proportion of running text made up of recurring chunks. The proposed procedure delivers frequency consolidation and substring reduction among word n-grams and is independent of any particular method of n-gram extraction and filtering, making it applicable also in situations where full access to underlying corpora is unavailable.
CITATION STYLE
Buerki, A. (2017). Frequency consolidation among word N-grams: A practical procedure. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 10596 LNAI, pp. 432–446). Springer Verlag. https://doi.org/10.1007/978-3-319-69805-2_30
Mendeley helps you to discover research relevant for your work.