On the give and take between even...
On the give and take between event apprehension and utterance formulation q Lila R. Gleitman, David January, Rebecca Nappa, John C. Trueswell * Department of Psychology, Institute for Research in Cognitive Science, University of Pennsylvania, 3401 Walnut Street, Room 302C, Philadelphia, PA 19104-6228, USA Received 2 July 2006 revision received 22 January 2007 Available online 16 July 2007 Abstract Two experiments are reported which examine how manipulations of visual attention affect speakers��� linguistic choices regarding word order, verb use and syntactic structure when describing simple pictured scenes. Experiment 1 presented participants with scenes designed to elicit the use of a perspective predicate (The man chases the dog/The dog flees from the man) or a conjoined noun phrase sentential Subject (A cat and a dog/A dog and a cat). Gaze was directed to a particular scene character by way of an attention-capture manipulation. Attention capture increased the likelihood that this charac- ter would be the sentential Subject and altered the choice of perspective verb or word order within conjoined NP Subjects accordingly. These effects occurred even though participants reported being unaware that their visual attention had been manipulated. Experiment 2 extended these results to word order choice within Active versus Passive structures (The girl is kicking the boy/The boy is being kicked bythe girl)andsymmetricalpredicates (The girl is meeting the boy/The boy is meeting the girl). Experiment 2 also found that early endogenous shifts in attention influence word order choices. These findings indicate a reliable relationship between initial looking patterns and speaking patterns, reflecting considerable parallelism between the on-line apprehension of events and the on-line construction of descriptive utterances. �� 2007 Elsevier Inc. All rights reserved. Keywords: Sentence production Word order Visual attention Attention capture Eye movements People seem to think before they speak: Having understood and conceptualized some event or state of affairs, they construct and utter some phrase or sentence to describe it. On this picture, the relationship between apprehension and linguistic formulation is sequential, incremental, and causal. But is the progression from thought to speech always as tidy as this? Words some- times seem to start tumbling forth before we fully appre- hend a scene or organize our thoughts about it. In light of these contrasting intuitions, it is perhaps not surpris- ing that debate concerning the timing and information characteristics of apprehension and linguistic formula- tion has a venerable psycholinguistic history. (See Bock, Irwin, & Davidson, 2004, for a recent review of the liter- ature, which dates back most notably to Lashley, 1951 Paul, 1886/1970 and Wundt, 1900/1970 and also includes the recent experimental literature on sentence www.elsevier.com/locate/jml Journal of Memory and Language 57 (2007) 544���569 Journal of Memory and Language 0749-596X/$ - see front matter �� 2007 Elsevier Inc. All rights reserved. doi:10.1016/j.jml.2007.01.007 q Authorship order was determined alphabetically by last name. We thank Katherine McEldoon for her helpful com- ments and assistance on drafts of this paper. This work was partially funded by a grant to L.R.G. and J.C.T. from the National Institutes of Health (1-R01-HD37507). * Corresponding author. Fax: +1 215 898 7301. E-mail address: firstname.lastname@example.org (J.C. Trueswell).
production.) Here we examine cases in which the appre- hension of the visual world and the production of an utterance describing it suggest a surprisingly tight tem- poral coupling between perceptual and linguistic processes. Factors controlling sentence production Obviously there are several veridical ways to describe any single scene. For example consider Fig. 1. Any of the following utterances (some of which are more natural than others) adequately describe this scene. 1 a. A dog is chasing a man. b. A man is running away from a dog. c. A dog is pursuing a man. d. A man is fleeing a dog. e. A dog is being fled from by a man. f. A man is being chased by a dog. How does the speaker choose among these options? Beginnings Several aspects of this problem can be characterized as ������starting point questions������ because the first-men- tioned word or phrase constrains both the form and con- tent of the remainder of the utterance (Bock et al., 2004). For instance, speakers typically begin their description of Fig. 1 with one of two noun phrases (henceforth, NP): A man. . .(as in 1b, d, f) or A dog. . .(as in 1a, c, e). This choice is often characterized as hinging, at least in part, on some notion of accessibility which itself branches into several subtypes. One level of accessibility is perceptual and concerns just where the speaker���s eyes land first���on the dog or the man. Plausibly, this property of initially inspecting the scene could have a corresponding influence on what is mentioned first. Effects of such visual landing sites have been studied indirectly in experiments in which attentional focus is drawn to a particular character. Notably, Tomlin (1997) repeatedly showed participants short cartoons of one fish eating another. Throughout, an arrow pointed to a particular fish and participants were to keep their eyes on that fish during the presenta- tion. Under these conditions participants tended to men- tion the indicated fish first, choosing it as the Subject even when this meant using the ordinarily disfavored Passive structure (e.g. The red fish is being eaten by the blue fish). Thus, at least in some highly constrained situ- ations, there appears to be an influence on sentence for- mulation of prior or simultaneous visual attention to some particular individual in the scene. However, word order is also responsive to higher- level accessibility factors, and these may weaken or even obliterate any effect of first visual landing-site. For instance, some constructional types are preferred to oth- ers, e.g., all other things held equal, Active voice sen- tences (1a, b, c, d) are strongly favored over passives (1e or f) unless specific presuppositional supports are provided (e.g., Bock, 1986 Bock & Loebell, 1990 Slo- bin & Bever, 1982). Related accessibility distinctions hold on the conceptual side: For example, creatures higher in an animacy hierarchy tend to be in Subject position making 1b, d, and f preferred over 1a, c, and e (Dowty, 1991 see also Bock, 1986). These conceptual and linguistic preference factors themselves interact. Because frequent words are favored over infrequent ones and chase is a more frequent lexical item than either pursue or flee, this might promote the use of chase (1a or f) over the other descriptions (Gri���n & Bock, 2000). Such a tendency may be enhanced by a semantic bias, across predicates, to favor descriptions in which the source is the logical Subject (1a, c, or f) and the goal is the Object over goal-to-source descriptions (1b, d, or e Fisher, Hall, Rakowitz, & Gleitman, 1994 Lakusta & Landau, 2005). Apprehension of gist In summarizing factors guiding utterance formula- tion, we have so far implicitly envisaged an incremental process in which an utterance-initiating word or phrase���the ������starting point���������is chosen, and further effects on the sentence are constrained by this first choice. But this picture is at best oversimplified and may even be a false characterization. That is, the so- called starting points may themselves be effects of a prior global apprehension of the scene in view, i.e., its concep- tual-semantic gist. As Bock et al. (2004) have recently put this: ������What cements a starting point is not the relative salience of elements in the perceptual or conceptual underpinnings Fig. 1. A sample Perspective Predicate scene, depicting the verb pair chase/flee. L.R. Gleitman et al. / Journal of Memory and Language 57 (2007) 544���569 545
of a message, but the identification of something that affords a continuation or completion that is, a predica- tion.������ (Bock et al., 2004, p. 270) Indeed, the experimental evidence for visual-attentive factors guiding nominal (or other ������elemental������) starting points is quite weak. Consider again Tomlin (1997). In this experiment, an arrow superimposed on the picture told the participants which fish to look at, and they were instructed to maintain this fixation throughout the gen- eration of their utterance. This rather blatant manipula- tion of attention leaves open the possibility that participants were aware of the intention of the study, thus producing the expected findings in contravention of their behavioral tendencies under more neutral conditions. Moreover, repeated description of the same event (all trials were fish-eating-fish events) essentially precludes generalization (see Bock et al., 2004, for discussion of this point). And the repetition of the fish- characters across trials might itself create confounds. For example, inspection of the Tomlin (1997) videos (available on the web at http://logos.uoregon.edu/tom- lin/research.html) reveals that the cued fish on any given trial (e.g., the red fish) was always present on the immediately preceding trial, but the uncued fish (e.g., the blue fish) was never present on the previous trial. Thus, the cueing of a particular fish was perfectly con- founded with which fish had been mentioned most recently by the participant. Given that recent mention of an entity promotes Subject status on its own, it is entirely plausible that this discourse factor, rather than attentional cueing, was determining the speakers��� choice. In fact, subsequent studies (Bock, Irwin, Davidson, & Levelt, 2003 Gri���n & Bock, 2000) suggest that eye position may not be a cause of word order choice, but rather an artifact generated as a consequence of the more global semantic analysis of the scene. In the words of Bock et al. (2003) ������. . .when speakers produce fluent utterances to describe events, the eye is sent not to the most salient element in a scene, but to an element already established as a suitable starting point.������ (Bock et al., 2003, p. 680). Bock et al. (2003) based this conclusion on experi- ments (Bock et al., 2003 Gri���n & Bock, 2000) that employed a task similar to Tomlin���s (1997) except that no visual cues or attention instructions were used. Instead, participants��� eye movements were recorded as they carried out this task. When coupled with the con- tent and the timing of the utterances, such eye move- ments can provide a strikingly fine-grained measure of the relationship between visual apprehension and lin- guistic formulation. In Gri���n and Bock (2000), participants viewed and described line drawings depicting simple agent-patient events such as the one in Fig. 2. In English, there is room for choice as to which character to mention first while preserving the general meaning of the sentence because the scene in Fig. 2 can be described with an Active The girl is spraying the boy or a Passive sentence The boy is getting/being sprayed by the girl.1 Of course Eng- lish speakers are disinclined to utter Passive-voice sen- tences so to increase the likelihood of Passive production, three (out of a total of eight) stimulus pic- tures (which were then mirrored and role-traded to cre- ate four stimulus lists) involved one human and one non-human character. Because human characters tend to appear as sentential Subjects, this increased the num- ber of Passive descriptions when the human participant was the Patient of the action. Gri���n and Bock (2000) reasoned that if output from the early production stages involving apprehension of an event were expeditiously passed along to the later stages geared towards formulation of a linguistic characteriza- tion, then initial fixations to characters (and their sequential ordering) should be predictive of their Fig. 2. A figure similar to those used by Gri���n and Bock (2000), depicting a simple transitive action (A girl spraying a boy). 1 Gri���n and Bock (2000) manipulated experimentally which character played which role. Therefore as a between-Subject variable there was a ������role reversal������ variant for each of the eight original pictures (Each picture and its role-reversed variant are herein called a ������stimulus type.������) For the present example (Fig. 2), the role-reversed picture would have shown a boy spraying a girl/a girl being sprayed by a boy. It should also be mentioned that there was a single example of a perspective-verb pair type namely a scene of chasing/fleeing (see Fig. 1 for an equivalent used in our own experiments). For such verbs there is a non-Passive Patient-first alternative, namely flee or run away from, thus potentially unconfounding constructional and first-mention factors. The flee-type responses were collapsed together with the Passives for analysis in these experiments, though their form is probably always Active voice (������was run away from by������ and ������was fled from by������ being awkward and therefore unlikely locutions). 546 L.R. Gleitman et al. / Journal of Memory and Language 57 (2007) 544���569
description order. On the other hand, if the initial con- ceptualization depended solely upon the processes involved in apprehending the relations between charac- ters in a scene, and linguistic considerations became a factor only later in the process, initial fixation on one character or the other would not predict which is men- tioned first (and hence which is placed in grammatical Subject position, whether in an Active or Passive frame). Results of Gri���n and Bock���s analyses supported the latter prediction: Speakers almost always uttered Active sentences, and the first 300 ms of the eye movement record showed no significant difference between looking times to the character that would ultimately be men- tioned first versus the one that would be mentioned sec- ond initial fixation on one character or the other did not predict Subjecthood in the upcoming utterance. A large difference in looking patterns did emerge beyond 300 ms after visual inspection began: Subject-referents were fix- ated more than Object-referents just prior to speech onset, and the opposite was true just after speech onset. Gri���n and Bock interpreted this pattern as consis- tent with a rapid initial apprehension period, during which the gist of the event is extracted. In their words, ������The evidence that apprehension preceded formulation, seen in both event comprehension times and the depen- dency of grammatical role assignments on the conceptual features of major event elements, argues that a wholistic process of conceptualization set the stage for the creation of a to-be-spoken sentence.������ (Gri���n & Bock, 2000, p. 279). Additional support for this conclusion was found in results from a separate group of participants who viewed the same pictures but were instead asked only to select the character being acted upon (the Patient). Here, eye-movements diverged between Patient and Agent approximately 300 ms into viewing. Given that Patient selection requires event apprehension, the data suggest that it is possible to achieve this gist-extraction process in the first 300 ms of viewing these stimuli. These and subsequent supportive studies (Bock et al., 2003) suggested to the authors not only a separation of apprehension and formulation processes but a clear tem- poral dissociation as well. As Gri���n and Bock (2000) put this, ������The results point to a language production process that begins with apprehension or the generation of a message and proceeds through incremental formulation of sen- tences������ (Gri���n & Bock, 2000, p. 279). Open issues Despite these useful findings, the current literature leaves a number of issues concerning utterance planning unresolved. Specifically, neither the more serial nor the more interactive accounts that have been proposed delve too deeply into questions involving the conceptualiza- tion stage itself. Many otherwise sequential models (e.g., Levelt, 1989 Levelt, Roelofs, & Meyer, 1999) allow for feedback between the conceptual stage of sen- tence planning and lemma representation, for example. Research exploring the question of the conceptual fac- tors underlying word order choices has implicated vari- ables such as concreteness, predicability, and particularly animacy as driving forces in Subject role assignment (see MacDonald, Bock, & Kelly, 1993, for a discussion) but has not investigated the time course with which any such conceptual factors contribute to the process of selecting thematic and syntactic roles when producing an utterance. For example, as one is apprehending a man participating in some event (an event not yet specified at the message level), will the pro- duction system generate a lemma candidate MAN to participate in the yet-to-be-determined proposition? Or is further apprehension of the relational information rel- evant to the man (e.g., Is he wearing a red hat? Or near a bicycle?) necessary before such linguistic planning can begin? Gri���n and Bock (2000) endorse the latter account and support it with the aforementioned finding: Early fixations (in the first 300 ms of viewing a scene) in their studies simply did not predict the order in which fixated characters were mentioned in a descriptive sentence. Gri���n and Bock���s results are, however, surprising not only in light of Tomlin (1997) but also from findings in the perception literature suggesting that initial gaze direction can exert a powerful influence on the outcome of the apprehension process itself (Ellis & Stark, 1978 Gale & Findlay, 1983 Pomplun, Ritter, & Velichkov- sky, 1996). For instance, manipulation of a perceiver���s first fixation influences his/her interpretation of ambigu- ous figures (Georgiades & Harris, 1997). In this study, participants viewed ambiguous images such as the clas- sic mother-in-law/wife image, each of which had been preceded by a fixation crosshair that was designed to direct initial attention to certain aspects of the image. Attending first to visual features that are critical to the mother-in-law interpretation increased reports of a mother-in-law, and mutatis mutandis. These features of the scene are independent of any general salience fac- tors having to do with mothers-in-law or wives, or, apparently, with visual properties of mothers-in-law and wives as portrayed in this image. Rather, the finding suggests, much as do Tomlin���s findings, that what you first look at becomes, in virtue of that, the focus of your attention. A related study concerns how attention influences the assignment of perceptual Figure and Ground. Vecera, Flevaris, and Filapek (2004) presented partici- pants with simple images such as the one depicted in L.R. Gleitman et al. / Journal of Memory and Language 57 (2007) 544���569 547
Fig. 3. This image is ambiguous in that it can be inter- preted either as a gray figure on a black background or as a black figure on a gray background. In the exper- iment, participants��� attention was captured to one part of the image via a brief (50 ms) flash that accompanied stimulus onset. Such a cue is known to draw a participant���s eye movements in a way that is rarely noticed by the participant (McCormick, 1997). Inter- estingly, Vecera et al. found that the cued region was more likely to be subsequently interpreted as the Figure. In sum, the perception literature suggests that endog- enous and exogenous contributions to initial attention can generate changes in interpretation of an image and even the assignment of Figure���Ground. In contrast, there was no trace of such an effect in Gri���n and Bock (2000), seeming to suggest that the speaker���s visual attention (as indexed by initial fixation and early look- ing-time preference) and his/her subsequent speech behavior (as indexed by first-mentioned character) are divided by a conceptual firewall that reorganizes the observed event for the sake of speech under quite differ- ent influences. This may simply be the fact of the matter, but the mismatch between these literatures provides at least some impetus for further investigation. Indeed, it is important to reiterate that, thus far, published eye movement analyses of depicted events are currently limited to Gri���n and Bock (2000), who studied just 8 pictorial items (and their role-traded vari- ants). And these were so constructed that, with a single exception (the chase/flee example), they required partic- ipants to utter Passive-type sentences as the only envi- ronment in which to show effects of initial attention. But we know that English speakers, on independent grounds, tend to disfavor the Passive in speech (e.g., Slobin & Bever, 1982 Goldman-Eisler & Cohen, 1970). This imbalance in constructional preference rather than (or in addition to) any tendency to sequence utterance formulation may have accounted for the experimental results. The bias to utter a canon- ical Active-voice sentence may have overwhelmed any observable effects of initial attention. Stimulus types used in the present study Following the methodology of Gri���n and Bock (2000), in the present study we asked participants to describe novel depicted scenes, but these were designed to elicit various kinds of linguistically different but semantically equivalent utterances. (By ������semantically equivalent,������ we mean two utterances that have roughly equivalent meanings, but may have different discourse or focusing properties.) Such types allow us to see what is driving linguistic choice when the conceptualization of the event is held constant (or close to constant). In addi- tion to the Active/Passive alternation we examined three further productive word-order alternations. Each is exemplified in Fig. 4. We chose these linguistic alternations because, although they are all semantically equivalent, each type differs in the extent to which the alternatives share the same linguistic-structural forms, the same discourse implications (e.g., Given vs. New), and the same infor- mation structure (e.g., Figure versus Ground). (1) Active/Passive Pairs are often put forth as the classic structural alternation in English that preserves propositional meaning: If the cat drinks the milk, it fol- lows that the milk is drunk by the cat. Not only are Active/Passive pairs usually semantically equivalent descriptions of events,2 they are both descriptions of the very same event. It strains credulity to suppose, for example, that Jane could observe the cat drinking the milk while George simultaneously observes that (very) milk being drunk by that (very) cat, and yet the two of them are observing ������different events.������ However, these alternative forms differ considerably in other regards that may be relevant in linguistic processing tasks, e.g., the Active form is more frequent than the Passive, less complex, acquired earlier, and more accessible. (2) Perspective Predicates describe the same scene from the standpoint of one or the other character in the event. For Fig. 4B (repeated here from Fig. 1 Fig. 3. An image used by Vecera et al. (2004) to investigate contributions of attention to figure���ground assignment in visual perception. �� Blackwell Publishing 2004 2 We say that Passivization only ������usually������ yields a semanti- cally equivalent sentence because, among other exceptions, it notoriously interacts with quantification thus Every boy kissed at least one woman does not entail that At least one woman was kissed by every boy. Stimuli in this experiment do not implicate such problems. 548 L.R. Gleitman et al. / Journal of Memory and Language 57 (2007) 544���569