Efficient n-gram analysis in R with cmscu

0Citations
Citations of this article
26Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

We present a new R package, cmscu, which implements a Count-Min-Sketch with conservative updating (Cormode and Muthukrishnan Journal of Algorithms, 55(1), 58–75, 2005), and its application to n-gram analyses (Goyal et al. 2012). By writing the core implementation in C++ and exposing it to R via Rcpp, we are able to provide a memory-efficient, high-throughput, and easy-to-use library. As a proof of concept, we implemented the computationally challenging (Heafield et al. 2013) modified Kneser–Ney n-gram smoothing algorithm using cmscu as the querying engine. We then explore information density measures (Jaeger Cognitive Psychology, 61(1), 23–62, 2010) from n-gram frequencies (for n=2,3) derived from a corpus of over 2.2 million reviews provided by a Yelp, Inc. dataset. We demonstrate that these text data are at a scale beyond the reach of other more common, more general-purpose libraries available through CRAN. Using the cmscu library and the smoothing implementation, we find a positive relationship between review information density and reader review ratings. We end by highlighting the important use of new efficient tools to explore behavioral phenomena in large, relatively noisy data sets.

Cite

CITATION STYLE

APA

Vinson, D. W., Davis, J. K., Sindi, S. S., & Dale, R. (2016). Efficient n-gram analysis in R with cmscu. Behavior Research Methods, 48(3), 909–921. https://doi.org/10.3758/s13428-016-0766-5

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free