Page 1
Integrating experiential and dist...
Integrating Experiential and Distributional Data to Learn Semantic Representations Mark Andrews, Gabriella Vigliocco, and David Vinson University College London The authors identify 2 major types of statistical data from which semantic representations can be learned. These are denoted as experiential data and distributional data. Experiential data are derived by way of experience with the physical world and comprise the sensory-motor data obtained through sense receptors. Distributional data, by contrast, describe the statistical distribution of words across spoken and written language. The authors claim that experiential and distributional data represent distinct data types and that each is a nontrivial source of semantic information. Their theoretical proposal is that human semantic representations are derived from an optimal statistical combination of these 2 data types. Using a Bayesian probabilistic model, they demonstrate how word meanings can be learned by treating experiential and distributional data as a single joint distribution and learning the statistical structure that underlies it. The semantic representations that are learned in this manner are measurably more realis- tic���as verified by comparison to a set of human-based measures of semantic representation���than those available from either data type individually or from both sources independently. This is not a result of merely using quantitatively more data, but rather it is because experiential and distributional data are qualitatively distinct, yet intercorrelated, types of data. The semantic representations that are learned are based on statistical structures that exist both within and between the experiential and distributional data types. Keywords: semantic representations, probabilistic models, Bayesian models, computational models, distributional data In this article, we consider the topic of how human semantic representations may be learned by integrating two distinct types of statistical data. Throughout the article, we use the term semantic representation to refer to the mental representation of the meaning of words. This we define informally as the knowledge underlying the ability to make inferences about, for example, which words are synonymous, which are similar, what are the various senses of a given word, what (if anything) are its referents, and so on. The theoretical objective of the article is to consider how this knowl- edge is acquired. We address this question specifically by asking what the different types of statistical data from which human beings can gain this knowledge are and how these data can be integrated to form semantic representations. For the purpose of this article, we concentrate on two broad and general types of statistical data. We refer to the first as experiential data and the second as language-based distributional data, or just simply distributional data. Experiential data specify the perceived physical attributes or properties associated with the referents of words. For example, the word apple refers to objects in the world whose perceived attributes or properties include being red or green, round, shiny, smooth, crunchy, juicy, sweet, tasty, and so on. As we use the term, experiential data are the entirety of the data obtained directly through sense receptors and from which people gain their knowledge of the world. We also use the term in a more general sense to include affective properties, such as whether something is pleasant, unpleas- ant, fearsome, and so on, and to include affordances (Gibson, 1977, 1979 Norman, 1988), or the properties that tell people how to interface and interact with an object. The second type of data that we consider is distributional data. Distributional data specify how a given word is statistically distributed across different spoken or written texts. We use the term text here in a very general sense to refer to any coherent and self-contained piece of written or spoken language. This could include, for example, a newspaper article, a spoken conversa- tion, a letter or e-mail message, an essay, a speech, and so on. If we divide a corpus of spoken and written language into a set of separate texts, the distribution of a given word across this corpus is simply whether and how often it appears within each text. As we describe them in this article, experiential and distributional data represent data types that are, respectively, extralinguistic versus intralinguistic in their origin. In other words, experiential data are data derived from human perception and interaction with the physical world, while distributional data are derived from the statistical char- acteristics within a language itself. It is reasonable to assume and, as we explain below, it has been empirically verified that semantic Mark Andrews, Gabriella Vigliocco, and David Vinson, Cognitive, Perceptual and Brain Sciences, Division of Psychology and Language Sciences, University College London, London, United Kingdom. This research was supported by European Union (FP6-2004-NEST- PATH) Grant 028714 and U.K. Biotechnology and Biological Sciences Research Council (BBSRC) Grant 31/S18 to Gabriella Vigliocco and by U.K. Economic and Social Research Council (ESRC) Grant RES-620-28- 6001 to the Deafness, Cognition and Language Research Centre, Univer- sity College London. All the codes necessary to simulate the models described in this article are available at http://www.mjandrews.net/code. Correspondence concerning this article should be addressed to Mark Andrews, Cognitive, Perceptual and Brain Sciences, Division of Psychol- ogy and Language Sciences, University College London, Gower Street, London WC1 6BT, United Kingdom. E-mail: m.andrews@ucl.ac.uk Psychological Review �� 2009 American Psychological Association 2009, Vol. 116, No. 3, 463���498 0033-295X/09/$12.00 DOI: 10.1037/a0016261 463
Page 2
representations can be learned from either one of these sources. For example, part of the meaning of the word apple can be learned from the fact that it refers to that set of objects with the properties or attributes like those mentioned above (i.e., being red or green, round, shiny, smooth, crunchy, juicy, sweet, tasty, etc.). Words like peach, pear, or apricot refer to objects with broadly similar characteristics and can thus be inferred to be similar in meaning. On the other hand, the word apple also occurs within phrases like the leaves of the apple tree, a glass of apple juice, roast pork with apple sauce, and so on. Words like cherry, cranberry, or pear can also be found within broadly similar linguistic contexts, and on this basis, irrespective of their possible physical referents, these words can be inferred as being semantically similar. As we describe below, the contribution of either experiential or distributional data to the learning of human semantic representa- tions has been studied extensively in recent literature within cog- nitive science. For the most part, throughout this literature, how- ever, the contribution of either one of these data types has been considered independently and to the exclusion of the other. In almost all cases, attention has been focused exclusively upon either experiential data alone or upon distributional data alone. The primary objective of this article is to consider the combined effects of both sources of data. Our theoretical proposal is that experiential and distributional data both represent major sources of data from which humans can learn semantic representations. In particular, we propose that the statistical patterns describing the structure of experiential data and those describing the structure of distributional data can be jointly used and combined to learn semantic representations. As such, our primary hypothesis, one that we make precise in later sections, is that semantic representations are the product of what we call the statistical combination of experiential and distributional data types. By this, we mean that semantic representations are derived from the optimal statistical combination of experiential and distribu- tional data, rather than either relying primarily upon one source alone or by simply averaging over the separate effects of both sources. Two Traditions in the Study of Semantic Representations In the past literature on human semantic representations, the contributions of experiential and distributional data have been approached largely independently and to the exclusion of one another. This has led, in effect, to two broad traditions in the contemporary study of semantic representation. The first tradition, what we call the experiential tradition, describes semantic repre- sentations in terms of the attributes or properties associated with words. As such, it argues largely in favor of the primacy of extralinguistic data, emphasizing how semantic representations are learned from human experience and interaction with world. By contrast, the second or distributional tradition describes semantic representations in terms of the statistical patterns that occur within a language itself. This tradition argues for the primacy of intralin- guistic data and emphasizes how semantic representations are learned by way of human experience and use of language. Each one of these traditions can be seen to have distinct historical origins and philosophical antecedents, and these backgrounds have shaped the nature and focus of the two traditions in contemporary cognitive science research. Experiential Tradition The experiential tradition is broadly based on a philosophical perspective that characterizes the meaning of words in terms of objects and events in the world. Specifically, this perspective characterizes the meaning of a word as corresponding to the mental representation of, for example, the object in the world to which that word refers, where these representations are ultimately based on the set of perceived physical properties of the object. This general philosophical perspective can be seen to have its origin in early modern empiricism, particularly that of Locke (1632���1704). Locke���s perspective on human cognition is described in An Essay Concerning Human Understanding (Locke, 1689/1975). For Locke, all knowledge is ultimately based upon sensible qualities, or sensory data derived through various sensory modalities. Locke���s theory of word meaning is that the meaning of a word is the mental representation of the object to which it refers, or, as Locke put it, ���words in their primary or immediate signification stand for nothing but the ideas in the mind of him that uses them��� (Locke, 1689/1975, Book III, Chapter II, Part 2). The semantic representation of a word like apple is simply the concept or representation of an apple. This representation is, according to Locke, a hierarchical composition of the elementary perceptual attributes of an apple such as its size, shape, color, and so on. For Locke, word meanings are thus represented as patterns over the properties of objects in the world. The empiricist theoretical perspective exemplified by the work of Locke has been widely adopted throughout the study of human semantic representations within cognitive psychology and cogni- tive neuroscience. This perspective is clearly evident in the sem- inal works in this area. For example, in Quillian (1967, 1969) and Collins and Quillian (1969), knowledge is represented in hierar- chical terms: General categories are described in terms of subcat- egories or their constituent objects, while the constituent objects are described in terms of their perceived attributes and properties. Likewise, E. Smith, Shoben, and Rips (1974) explicitly defined the semantic representation of a word in terms of the set of properties associated with its referent. Following Rips, Shoben, and Smith (1973), E. Smith et al. proposed that semantic memory can be described as multidimensional space, the dimensions of which are this set of properties. This work has led naturally to a more statistical interpretation of the problem of semantic representation. According to this view, words correspond to sets of perceived properties or, equivalently, to points in a high-dimensional space. Learning semantic representations corresponds to learning the intrinsic statistical structure of this space. We can find this general statistical perspective in the early models of distributed memory presented by McClelland and Rumelhart (1985). In this work, words are explicitly described as distributions over elementary attributes, and neural network learning models are used to learn the structure of these distributions. In cognitive neuroscience, these same principles and techniques underlie the models of deep dys- lexia by Hinton and Shallice (1991) and Plaut and Shallice (1993) and the models of category-specific deficits by Farah and McClel- land (1991) Devlin, Gonnerman, Andersen, and Seidenberg (1998) and Tyler, Moss, Durrant-Peatfield, and Levy (2000). In the study of semantic representation in normal or cognitively unimpaired subjects, we see these principles in the work of McRae, de Sa, and Seidenberg (1997). In this work, a speaker- 464 ANDREWS, VIGLIOCCO, AND VINSON
Page 3
generated data set of 190 common nouns, each described in terms of 1,242 elementary features, was used. Using an attractor network to learn the distributions over word-form and semantic-feature representations of these 190 words, it was confirmed that the distances between the representations of these words in the attrac- tor network predicted human judgments on the similarities be- tween words as evidenced by a priming task. In related work, Vigliocco, Vinson, Lewis, and Garrett (2004) collected feature norms for a set of 456 common words, comprising 240 nouns and 216 verbs. Participants described these words in terms of a set of 1,029 elementary features. A self-organizing map was used to learn the low-dimensional structure of this data. As in the work of McRae et al., it was found that words described in terms of this semantic space were predictive of human similarity judgments, as evidenced by semantic priming, picture-word interference (PWI), and error induction. Further work by McClelland and Rogers (2003) and Rogers and McClelland (2005) has shown that knowl- edge derived from the distribution of attributes associated with words is sufficient to explain the formation of hierarchical cate- gorical knowledge as in Quillian (1967, 1969), the progressive differentiation throughout development of semantic categories into finer subsets (Keil, 1979 Mandler, 2000 Mandler, Bauer, & McDonough, 1991), the early emergence of the so-called basic level of categories (Rosch, Mervis, Gray, Johnson, & Boyes- Braem, 1976), and inferences about the relative importance of certain properties within different categories (S. A. Gelman & Markman, 1986 Macario, 1991), a phenomenon often otherwise explained by way of innate knowledge of conceptual categories (Carey, 1985 Keil, 1979, 1989). Recently, there has been compelling evidence that the represen- tation of words in the brain is in terms of distributed patterns over the sensory-motor properties of their referents. For example, the premotor and motor cortices have been found to be consistently activated by language referring to body actions (Aziz-Zadeh, Wilson, Rizzolatti, & Iacoboni, 2006 Pulvermu ��ller, 1999, 2001 Pulvermu ��ller, Hauk, Nikulin, & Ilmoniemi, 2005 Tettamanti et al., 2005 Vigliocco et al., 2006), tool actions, tools, or manipula- ble objects (Chao & Martin, 2000 Gerlach, Law, & Paulson, 2002 Grabowski, Damasio, & Damasio, 1998). Transcranial magnetic stimulation studies have provided converging evidence that lexical and sentential items with motor associations activate motor areas of the cortex (Buccino et al., 2005 Oliveri et al., 2004) and localized motor cortical areas corresponding to the specific effec- tor of an action (Buccino et al., 2005 Pulvermu ��ller et al., 2005). Alongside the motor cortex, mediotemporal activity is repeatedly seen for body and tool actions as well as tool objects (Damasio et al., 2001 Martin, Haxby, Lalonde, Wiggs, & Ungerleider, 1995 Martin, Wiggs, Ungerleider, & Haxby, 1996 Phillips, Noppeney, Humphreys, & Price, 2002 Tettamanti et al., 2005). There is also some evidence that it is active during comprehension of words referring to fruit or an object���s form (Phillips et al., 2002 Pulver- mu ��ller & Hauk, 2005). With regard to sensory properties, the fusiform gyrus is documented as playing a role in the representa- tion of object form (Chao, Haxby, & Martin, 1999 Vuilleumier, Henson, Driver, & Dolan, 2002), and different areas of the fusi- form have been implicated for different categories, namely, lateral fusiform for animals and medial fusiform for tools (Martin & Chao, 2001). In a series of experiments, for example, Martin and colleagues (Beauchamp, Lee, Haxby, & Martin, 2002 Chao et al., 1999 Chao & Martin, 2000 Chao, Weisberg, & Martin, 2002 Ishai, Ungerleider, Martin, Schouten, & Haxby, 1999) showed that naming objects referring to different semantic categories activated a broad and largely overlapping region of the ventral and lateral temporal cortex but that the profile of activation differed depend- ing on category. This suggests that object concepts are represented according to object features, rather than according to semantic categories corresponding to specific and anatomically segregated modules. These results support the role of the fusiform in repre- senting the visual attributes of known objects, and more generally, this area of the cortex as involved in higher order visual associa- tion, combining features from different modalities (Vigliocco et al., 2006). Distributional Tradition If the philosophical background of the experiential tradition can be traced back to at least as early as 17th century empiricism, the corresponding philosophical background of the distributional tra- dition is of a more recent vintage. Wittgenstein (1953/1997) fa- mously proposed that ���for a large class of cases . . . in which we employ the term meaning, it can be defined thus: The meaning of a word is its use in the language��� (Section 43). Wittgenstein was arguing against what he perceived to be a pervasive conception of meaning and language. According to this view, language is a mirror of the world: Words refer to objects in world, while prop- ositions consist of words arranged into a structure that will mirror the interrelationships of their referents. Against this traditional view, Wittgenstein presented his alternative conception of lan- guage premised upon the idea that the meaning of word is based on how it used within a language. Rather than pointing to something exterior to the language, a word���s meaning is determined by the role it plays within the language itself. Although the precise implications of Wittgenstein���s ideas were (and, indeed, still remain) elusive, Firth (1957) was inspired by Wittgenstein���s characterization of language to propose that hu- mans can learn the meaning of a word by examining the various contexts and circumstances of its common usage. Firth suggested that ���you shall know a word by the company it keeps��� and that human beings learn at least part of the meaning of a word from ���its habitual collocation��� with other words (Firth, 1957, p. 11). For Firth, words that are found in identical or similar environments can be taken to share at least some of their meanings. In a similar manner, and writing contemporaneously to Firth, Harris (1954) proposed the distributional hypothesis whereby word meanings are derived in part from their distribution across different linguistic environments. Harris suggested, for example, that ���if (two words) A and B have almost identical environments . . . we say they are synonyms,��� while ���if A and B have some environments in com- mon and some not . . . we say that they have different meanings,��� with ���the amount of meaning difference corresponding roughly to the amount of difference in their environments��� (Harris, 1954, p. 157). Although the idea that word meanings may be learned from the distribution of words across a corpus was first mooted as early as the 1950s, this hypothesis was not seriously considered, and it is likely that its plausibility remained dubious, until computing re- sources grew to sufficient power. In this respect, one of the first attempts to adequately address this hypothesis was due to Schu��tze 465 INTEGRATING EXPERIENTIAL AND DISTRIBUTIONAL DATA
Page 4
(1992). In this work, following the common practice in informa- tion retrieval (Salton & McGill, 1983), the distribution of a word across a corpus is represented by a vector describing that word���s frequency of co-occurrence with every other word. As such, each word can be viewed as a point in high-dimensional space, and matrix factorization is used to find the intrinsic dimensionality of this space. Every point in the original high-dimensional space can be represented in the lower dimensional space, and the distance between these points is taken as a measure of the dissimilarity between the corresponding words. The work of Schu ��tze (1992) was to strongly influence the work of Lund and Burgess and their hyperspace analog of language (HAL) model( e.g., Burgess & Lund, 1997 Lund & Burgess, 1996 Lund, Burgess, & Atchley, 1995). The HAL model was motivated as an attempt to automatically derive a model of human semantic memory from the statistics in a text. Using a large Usenet corpus, the HAL model describes each word as a high-dimensional co-occurrence vector precisely as in Schu ��tze. The dimensionality of this space is reduced by simply removing all but the 200 highest variance columns. In Lund and Burgess (1996), it was shown that in this lower dimensional space, semantic categories like, for example, animals, body parts, and geographical regions form clusters that are easily distinguishable from one another. Lund and Burgess also showed that the distances between word pairs in the HAL space correlate positively, with a coefficient of up to r .35, with semantic-priming reaction times, that is, where these word pairs are the prime-target pairs in a lexical decision task. The latent semantic analysis (LSA) work of Landauer and colleagues (e.g., Landauer & Dumais, 1997 Landauer, Laham, & Foltz, 1998 Landauer, Laham, Rehder, & Schreiner, 1997) was influenced by the information-retrieval work of Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) but is compara- ble to the HAL model in its modeling objectives. In LSA, words are described as high-dimensional vectors indicating the extent to which they occur in a large set of documents in a corpus. Matrix factorization is used to find the intrinsic structure of this space, and the cosine of the angle between vectors in this reduced space can be taken as a measure of interword similarity. Using this measure, Landauer and Dumais (1997) reported comparable performance between LSA and humans. Using 80 items from a Test of English as a Foreign Language synonym test, the authors reported a per- formance of 64.4% correct for LSA, compared to 64.5% by non- native English speakers applying to U.S. colleges. They also showed that in cases where LSA was incorrect, its choices were positively correlated (r .44) with the choices made by the college applicants. In Landauer et al. (1998), it was further re- ported that LSA performed comparably to children (r .5) and adults (r .32) on a word-sorting task. More recently, the work of Griffiths and Steyvers (2002, 2003) and Griffiths, Steyvers, and Tenenbaum (2007) provided a prob- abilistic model of human semantic representation that is based upon the latent Dirichlet allocation (LDA) model of Blei, Ng, and Jordan (2003). The LDA model can be seen as a probabilistic generalization of LSA whereby each text in a corpus is a proba- bilistic weighting of a set of discourse topics, with each discourse topic corresponding to a probability distribution over words that emphasizes a certain theme. For example, the discourse topic labeled sport may place most of its probability mass on words like game, ball, play, team, competition, and so on. The aim of learning in these models is to infer the component topics. Each word in the model can then be represented as a distribution over these latent topics, and this can be taken to be its semantic representation. Using these representations, the relationships between words, namely, interword similarities, can also be inferred. For the pur- poses of comparison between the similarity relationships inferred by the model and those of human judgments, the Nelson word- association norms (Nelson, McEvoy, & Schreiber, 2004) were used as a standard. Griffiths and Steyvers (2003) and Griffiths et al. reported that across a large set of words, the most highly related word according to their model was exactly that of the most highly associated word according to the norms. They also reported supe- rior performance of their model, according to this measure, in comparison to the original LSA model of Landauer and Dumais (1997). The Necessity of Combining Data Types From what we have reviewed so far, it is evident that on the basis of either experiential data alone or distributional data alone, intuitively correct semantic relationships may be derived and that, moreover, these have psychological validity as evidenced by com- parisons with human-based measures of semantic similarity. How- ever, in all of the studies that we have reviewed, the focus of attention has been upon one of these data types alone, independent and to the exclusion of the other. In what follows, we show that there are obvious problems with any perspective that advocates the importance of one data type to the total exclusion of the other. An obvious criticism of experiential data is that they are largely limited to the so-called concrete terms, that is, those words that have tangible physical referents or instantiations. Concrete terms constitute only a small subset of the commonly used words in the language. Most other terms, even if we exclude the function words whose role is primarily syntactic, do not have obvious physical instantiations. These terms include not just the canonical examples of abstract terms such as truth, art, justice, and so on but also more mundane terms such as government, finance, crime, and so on. While these latter terms can be, either directly or indirectly, related to objects or events in the world, the meaning of these terms is in no way exhausted by these referents. It is arguable that their meaning is fully appreciated only in the context of a much richer body of knowledge about, for example, how people, economies, and societies work and that this body of knowledge can be only be acquired through language and the representational medium it affords. Knowledge acquired by way of the physical or experien- tial attributes of the referents of these terms, albeit substantial and important, is nevertheless inadequate to fully account for their semantic representations. In contrast, a fundamental criticism of distributional data is that they are disconnected or disembodied from the physical world. In other words, distributional data describe the relationship of words only to one another but not to the physical world or anything else beyond language itself. This fact alone is taken by some as an a priori argument against the plausibility of distributional models as accounts of human semantic representation. For example, Glen- berg and Robertson (2000) argued that because distributional approaches propose that ���the meaning of an abstract symbol (a word) can arise from the conjunction of relations to other unde- fined abstract symbols��� (p. 381), this is grounds for their rejection 466 ANDREWS, VIGLIOCCO, AND VINSON
Page 5
as plausible models of semantic representation. For Glenberg and Robertson, to know ���the meaning of an abstract symbol such as . . . an English word, the symbol has to be grounded in something other than more abstract symbols��� (Glenberg & Robertson, 2000, p. 382). Even if this perspective is not found to be convincing, the disembodied nature of distributional data does present real chal- lenges for distributional models of semantic representation. For example, as a consequence of their disembodied character, distri- butional models cannot account for any of the previously described neuroscientific evidence showing that words are represented in the brain according the sensory-motor characteristics of their referents. More generally, distributional models cannot explain how any knowledge acquired by way of distributional data can be related back to the world. Although two words may have similar distri- butional patterns and from that it may be inferred that they are semantically related, what they refer to in the world still cannot be known, nor can any world knowledge, otherwise acquired, be integrated with knowledge derived from distributional patterns. For example, on the basis of distributional patterns, it may be inferred that the terms dog and cat are related, but nonetheless, it is not clear that these words also refer to familiar domestic animals and pets, nor could any knowledge of these domestic creatures, acquired through interaction with them in the world, be integrated with this distributional-based knowledge. For knowledge acquired from language to be pragmatically useful, it must ultimately relate back to the world. It is challenging to explain how this can be the case if words are known only through their relationship with other words. It is not, however, necessary to choose between experiential and distributional data as if they were mutually exclusive. Both types of data are available to humans when learning the meaning of words. Words are encountered simultaneously within two rich contexts: the physical world itself and the discourse of human language. As such, it is reasonable to assume that both data types are used concurrently to learn word meanings. For example, one can imagine the following scenario: On the basis of perceptual experience, a child learns that the term dog refers to creatures that make barking noises, have four legs and waggy tails, and so on. In addition, through general experience with language, the child learns that the term dog co-occurs with terms like pet, animal, and so on. In this learning scenario, it is as if there is a dual, or parallel, corpus of data. On the one hand, there is the stream of words that is the language itself, and on the other, there is the set of perceived properties associated with (at least a subset of) the words. Knowl- edge that the word dog refers to those creatures that have four legs and tails, that bark, and so on can be integrated with the knowledge that dog co-occurs with pet, animal, and so on. The two sources of information could then be combined to provide a richer under- standing of the semantics of the word dog than could be learned by either source alone. In this article, we describe how experiential and distributional data can be combined to learn semantic representations. The man- ner in which we model this follows the same general rationale as that of the statistical models reviewed so far. While, in those models, the objective was to infer the statistical structure under- lying either experiential data alone or distributional data alone, in the models we use here, we aim to infer the statistical structure underlying the joint distribution of both data types. To our knowledge, the joint role of experiential and distribu- tional data in the learning of semantic representations has yet to be thoroughly investigated. There are, however, some notable studies that relate, or are precursors, to this work. The well-known dual- coding theory of Paivio (e.g., Paivio, 1971, 1986/1990) was one of the first theories to propose a distinction between information acquired by way of sensory processes and information acquired through language. Recently, studies such as, for example, Yu and Smith (2007) and L. B. Smith and Yu (2008) have shown how statistical regularities between the co-occurrences of words and the co-occurrences of their referents can facilitate the learning of word���object mappings, and others such as, for example, Roy and Pentland (2002) have shown how simultaneous statistics from visual and auditory senses can facilitate the discovery of words and their referents. Similarly, Louwerse (2008) investigated the paral- lels between information encoded in linguistic structures and that of more sensory-motor, or embodied, information. More relevant still is the work of Howell and Becker (2001) Howell, Becker, and Jankowicz (2001) and Howell, Jankowicz, and Becker (2005), who have shown that the acquisition of a language by a simple recurrent network is improved when words are augmented with sensory-motor representations. From this, it is argued that ground- ing words by way of sensory-motor representations can provide a form of semantic bootstrapping for the learning of a grammar. Despite the importance of these studies, however, none have specifically addressed the distinct nature of the semantic informa- tion provided by experiential versus language-based distributional data or how both information types could be integrated to form semantic representations. This, we believe, is the primary contri- bution of the work we describe in this article. The Consequences of Combining Data Types Any theory of semantic representation that is based on combin- ing experiential and distributional data will clearly redress some of the fundamental limitations inherent in theories based on using only one source alone. However, the importance of combining data is not simply that two major sources of data are being used. Rather, due to the fact that these two sources are correlated with one another, learning from both sources jointly permits more knowl- edge to be acquired from the available data and allows knowledge acquired from language to be related to knowledge about the world. This has important consequences for any theory of how knowledge���semantic knowledge in particular but any knowledge more generally���is acquired and used. By learning semantic representations from the joint distribution of experiential and distributional data, more semantic knowledge is gained from the available data than is possible using one source exclusively or using both independently. This is a consequence of the elementary statistical fact that all the information in a joint probability distribution cannot be known by reference to its mar- ginal distributions. We can see this in Figure 1, where the joint distribution P(x, y) varies across the subfigures, but both marginal distributions, P(x) and P(y), remain unchanged. By a direct anal- ogy, all the information from which semantic knowledge can be attained is given by the joint distribution over both experiential and distributional data. Obviously, considering only one of these sources will lead to information loss. Equally important, though perhaps less intuitively obvious, is that treating the two data sets 467 INTEGRATING EXPERIENTIAL AND DISTRIBUTIONAL DATA