A Conceptual Framework for Quanti...
Quality & Quantity 34: 259���274, 2000. �� 2000 Kluwer Academic Publishers. Printed in the Netherlands. 259 A Conceptual Framework for Quantitative Text Analysis On Joining Probabilities and Substantive Inferences about Texts CARL W. ROBERTS Department of Sociology or Department of Statistics, Iowa State University, Ames, IA 50011, U.S.A. Abstract. Quantitative text analysis refers to the application of one or more methods for drawing statistical inferences from text populations. After briefly distinguishing quantitative text analysis from linguistics, computational linguistics, and qualitative text analysis, issues raised during the 1955 Allerton House Conference are used as a vehicle for characterizing classical text analysis as an instrumental-thematic method. Quantitative text analysis methods are then depicted according to a 2 �� 3 conceptual framework in which texts are interpreted either instrumentally (according to the researcher���s conceptual framework) or representationally (according to the texts��� sources��� perspectives), as well as in which variables are thematic (counts of word/phrase occurrences), se- mantic (themes within a semantic grammar), or network-related (theme- or relation-positions within a conceptual network). Common methodological errors associated with each method are discussed. The paper concludes with a delineation of the universe of substantive answers that quantitative text analysis is able to provide to social science researchers. Key words: content analysis, text analysis, semantic grammar, network, instrumental versus representational, quantitative methods. 1. Introduction Painted with broad strokes, formal analyses of linguistic data are pursued from three academic orientations: linguistics, computer science, and the social sciences. Linguists��� interests are primarily in describing text structure: as surface forms pro- duced by an innate human capacity to generate linguistic expressions (Chomsky, 1965), as sequences of functional forms expressed by goal-oriented humans ac- cording to discrete narrative grammars (Griemas, 1984 [1966] Halliday, 1994), as patterns of symbols uttered by fallible native speakers in ways (in)consistent with some prescribed standard (Honey, 1983), etc. With recent developments in computer technology, linguists have begun to evaluate their theories by developing corresponding text-parsing software (cf. Rosner and Johnson, 1992). This work forms the academic branch of computational linguistics, in addition to which a more applied, commercial branch has developed with the objective of quickly ���understanding��� user input and yielding as output the user-expected outcome (Grishman, 1986 McEnery, 1992). Finally, social scientists conduct formal ana-
260 CARL W. ROBERTS lyses of written and spoken texts to reveal mechanisms according to which words influence and are influenced by human behavior. This paper is a review of quantitative text analysis methods ��� one of two general classes of methods currently in use for the social scientific analysis of texts. To be quantitative, a text analysis must both address a social scientific question of a well-defined text population, and provide an answer to the question having a known probability of inaccurately reflecting aspects of the text-population.1 Al- though there is presently an astounding number of recent books on qualitative text analysis (e.g., Fielding and Lee, 1991 Riessman, 1993 Silverman, 1993 Denzin and Lincoln, 1994 Feldman, 1994 Krueger, 1994 Marshall and Rossman, 1995 Miles and Huberman, 1994 Wolcott, 1994 Kelle, 1995 Weitzman and Miles, 1995), virtually all discussions of quantitative text analysis methods are written as if there had been no innovations in these methods since the 1960s (cf. Altheide, 1996 Lee, 1999 but, as an exception, Roberts, 1997a). In the following section, I use the mid-1955 Allerton House Conference debates on contingency analysis to introduce the classical approach to text analysis that then and into the 1980s was the most predominant text analysis methodology in the United States. The approach���s instrumental orientation is then differentiated from an increasingly util- ized representational text analysis orientation. To complete my 2 �� 3 classification of quantitative text analysis methods, I then distinguish classical thematic text analysis from more recent semantic and network text analysis methods. This paper���s purpose is to impose some long-needed structure on a wide spec- trum of text analysis methodologies that heretofore have been accessible only among a smattering of methodology journals. Its emphasis is on application. That is, it is intended to aid researchers in answering the question, ���Which quantitative text analysis method best affords answers to what research question?��� 2. Classical Text Analysis Quantitative text analysis has a long tradition in the works of Lasswell, Berel- son, George, Osgood, Pool, Stone, Holsti, Krippendorf, Weber, and many others. During a work conference held in 1955 at the University of Illinois-Monticello���s Allerton House many of these text analysis pioneers gathered to develop solutions to the methodological problems of the day. Trends in Content Analysis (Pool, 1959) is the scholarly legacy of this conference. The most influential, if not the largest faction among the conference���s parti- cipants was a group of Harvard researchers who made extensive use of what they called ���contingency analysis���. The first step in a contingency analysis involves counting occurrences of content categories within sampled blocks of text. This produces a data matrix like that in Table 1, with distinct content categories (or themes) heading the columns, unique text blocks heading the rows, and counts of occurrences (of theme within block) in the cells. The analysis proceeds by computing a matrix of associations between pairs of themes. Finally, the researcher
QUANTITATIVE TEXT ANALYSIS 261 Table I. A data matrix for a thematic text analysis ID-number Theme 1 Theme 2 Theme 3 1 2 0 0 2 0 0 1 3 1 3 1 4 0 2 1 5 0 0 0 �� �� �� �� �� �� �� �� �� �� �� �� develops (usually post hoc) explanations of why some themes co-occurred and why others were disassociated (i.e., negatively associated). During the Allerton House conference Alexander George (1959: 17f), whose work on World War II propaganda did not use contingency analysis, pointed out that such ���fishing expeditions��� (sic) are not sensitive enough to detect the instru- mental use to which communication is put by the speaker. Co-occurrences do not reveal how themes are used in the same blocks of text, let alone why the speaker used them that way. In response, exponents of contingency analysis acknowledged that their technique is not able to detect changes in communication strategies (i.e., it cannot legitimately be used to investigate instrumental communication), but that it can be used to trace patterns in representational communication (i.e., communic- ation that ���means what it says on its face���). Thus classification of representational communication into content categories is done based on the assumption that ���what an author says is what he means��� (Pool, 1959: 4). Amidst this curious exercise in turf delineation, Charles Osgood (1959) de- scribed and illustrated his evaluative assertion analysis ��� precursor of contemporary network text analysis ��� in building a defense for analyzing the representational content of communication. Here, in a casual but extremely insightful remark, Os- good (1959: 75) notes, ���As a matter of fact, we may define a method of content analysis as allowing for ���instrumental��� analysis if it taps message evidence that is beyond the voluntary control of the source and hence yields valid inferences despite the strategies of the source���. And later regarding contingency analysis, ���The final stage, in which the analyst interprets the contingency structure is entirely subject- ive, of course��� (p. 76). With these remarks, Osgood attached an entirely different meaning to the term, instrumental. No longer did it allude to the strategy behind a source���s communication, but to the researcher���s interpretive strategy in analyzing the communication. In this usage, words are not the source���s strategic instruments, but are symptomatic instruments from which the researcher can diagnose possibly unconscious or unacknowledged characteristics of the source.