This paper describes a method to automatically discover features which distinguish the language use of cultural subgroups operating within the same broader language/culture. Sociolinguists have long known that special features such as vocabulary use, phonetic features (like accents), and syntactic characteristics develop within the in-group language of frequently interacting subgroups. These features set apart the group's language from the discourse of others speaking the same broader language. Our interest is to learn these features automatically and use them to distinguish the writing of one subgroup from another. The special vocabulary and jargon of various subgroups has often been catalogued. This research focuses instead on syntactic differences which can be learned from digital text and the specialized use of vocabulary which is not topic or domain specific (e.g. we deliberately omit domain related jargon.) Our main data source is blogs and related discussions from a number of North American subculture groups, such as radical feminists and militia groups. In this paper we present our findings on looking for blogs whose participants have a particular subcultural affiliation, designated as "blogs of interest." Our hypothesis is that we can ignore the particular topic of a blog discussion, through means described in the paper, and isolate other linguistic indicators that help us determine whether or not a blog is "of interest". We start with an overview of the process of training our system and describe its use in identifying blogs of the desired cultural subgroup. We then describe in detail the training process in which a series of grams are scored and aggregated to find key, highly indicative blog passages. The last section reports on an experiment we conducted that proved the concept against several North American English language blogging communities. © 2012 Published by Elsevier B.V.
Paradis, R. D., Davenport, D., Menaker, D., & Taylor, S. M. (2012). Detection of groups in non-structured data. In Procedia Computer Science (Vol. 12, pp. 412–417). Elsevier B.V. https://doi.org/10.1016/j.procs.2012.09.095