Efficient n-gram analysis in R with cmscu

David W. Vinson; Jason K. Davis; Suzanne S. Sindi; Rick Dale

Journal ArticleOPEN ACCESS

Efficient n-gram analysis in R with cmscu

Behavior Research Methods (2016) 48(3) 909-921

DOI: 10.3758/s13428-016-0766-5

0Citations

26Readers

Abstract

We present a new R package, cmscu, which implements a Count-Min-Sketch with conservative updating (Cormode and Muthukrishnan Journal of Algorithms, 55(1), 58–75, 2005), and its application to n-gram analyses (Goyal et al. 2012). By writing the core implementation in C++ and exposing it to R via Rcpp, we are able to provide a memory-efficient, high-throughput, and easy-to-use library. As a proof of concept, we implemented the computationally challenging (Heafield et al. 2013) modified Kneser–Ney n-gram smoothing algorithm using cmscu as the querying engine. We then explore information density measures (Jaeger Cognitive Psychology, 61(1), 23–62, 2010) from n-gram frequencies (for n=2,3) derived from a corpus of over 2.2 million reviews provided by a Yelp, Inc. dataset. We demonstrate that these text data are at a scale beyond the reach of other more common, more general-purpose libraries available through CRAN. Using the cmscu library and the smoothing implementation, we find a positive relationship between review information density and reader review ratings. We end by highlighting the important use of new efficient tools to explore behavioral phenomena in large, relatively noisy data sets.

Author supplied keywords

Cite

CITATION STYLE

APA

Vinson, D. W., Davis, J. K., Sindi, S. S., & Dale, R. (2016). Efficient n-gram analysis in R with cmscu. Behavior Research Methods, 48(3), 909–921. https://doi.org/10.3758/s13428-016-0766-5

Efficient n-gram analysis in R with cmscu

Abstract

Author supplied keywords

Cite

Register to see more suggestions