From monkey-like action recogniti...
1. Action-oriented neurolinguistics and the mirror system hypothesis 1.1. Evolving the language-ready brain Two definitions: 1. A protolanguage is a system of utterances used by a particular hominid species (possibly including Homo sapi- ens) which we would recognize as a precursor to human lan- guage (if only the data were available!), but which is not it- self a human language in the modern sense.1 2. An infant (of any species) has a language-ready brain if it can acquire a full human language when raised in an en- vironment in which the language is used in interaction with the child. Does the language readiness of human brains require that the richness of syntax and semantics be encoded in the genome, or is language one of those feats ��� from writing history to building cities to using computers ��� that played no role in biological evolution but rested on historical de- velopments that created societies that could develop and transmit these skills? My hypothesis is that: Language readiness evolved as a multimodal manual/facial/vo- cal system with protosign (manual-based protolanguage) pro- viding the scaffolding for protospeech (vocal-based protolan- guage) to provide ���neural critical mass��� to allow language to emerge from protolanguage as a result of cultural innovations within the history of Homo sapiens.2 The theory summarized here makes it understandable why it is as easy for a deaf child to learn a signed language as it is for a hearing child to learn a spoken language. BEHAVIORAL AND BRAIN SCIENCES (2005) 28, 105���167 Printed in the United States of America �� 2005 Cambridge University Press 0140-525X/05 $12.50 105 From monkey-like action recognition to human language: An evolutionary framework for neurolinguistics Michael A. Arbib Computer Science Department, Neuroscience Program, and USC Brain Project, University of Southern California, Los Angeles, CA 90089-2520 firstname.lastname@example.org http://www-hbp.usc.edu/ Abstract: The article analyzes the neural and functional grounding of language skills as well as their emergence in hominid evolution, hypothesizing stages leading from abilities known to exist in monkeys and apes and presumed to exist in our hominid ancestors right through to modern spoken and signed languages. The starting point is the observation that both premotor area F5 in monkeys and Broca���s area in humans contain a ���mirror system��� active for both execution and observation of manual actions, and that F5 and Broca���s area are homologous brain regions. This grounded the mirror system hypothesis of Rizzolatti and Arbib (1998) which offers the mirror system for grasping as a key neural ���missing link��� between the abilities of our nonhuman ancestors of 20 million years ago and modern human language, with manual gestures rather than a system for vocal communication providing the initial seed for this evolutionary process. The present article, however, goes ���beyond the mirror��� to offer hypotheses on evolutionary changes within and outside the mirror sys- tems which may have occurred to equip Homo sapiens with a language-ready brain. Crucial to the early stages of this progression is the mirror system for grasping and its extension to permit imitation. Imitation is seen as evolving via a so-called simple system such as that found in chimpanzees (which allows imitation of complex ���object-oriented��� sequences but only as the result of extensive practice) to a so-called complex system found in humans (which allows rapid imitation even of complex sequences, under appropriate conditions) which supports pantomime. This is hypothesized to have provided the substrate for the development of protosign, a combinatorially open reper- toire of manual gestures, which then provides the scaffolding for the emergence of protospeech (which thus owes little to nonhuman vo- calizations), with protosign and protospeech then developing in an expanding spiral. It is argued that these stages involve biological evo- lution of both brain and body. By contrast, it is argued that the progression from protosign and protospeech to languages with full-blown syntax and compositional semantics was a historical phenomenon in the development of Homo sapiens, involving few if any further bio- logical changes. Key words: gestures hominids language evolution mirror system neurolinguistics primates protolanguage sign language speech vocalization Michael Anthony Arbib was born in England, grew up in Australia, and received his Ph.D. in Mathematics from MIT. After five years at Stanford, he became chair- man of Computer and Information Science at the Uni- versity of Massachusetts, Amherst, in 1970. He moved to the University of Southern California in 1986, where he is Professor of Computer Science, Neuroscience, Biomedical Engineering, Electrical Engineering, and Psychology. The author or editor of 38 books, Arbib recently co-edited Who Needs Emotions? The Brain Meets the Robot (Oxford University Press) with Jean- Marc Fellous. His current research focuses on brain mechanisms of visuomotor behavior, on neuroinfor- matics, and on the evolution of language.
1.2. The mirror system hypothesis Humans, chimps and monkeys share a general physical form and a degree of manual dexterity, but their brains, bodies, and behaviors differ. Moreover, humans can and normally do acquire language, and monkeys and chimps cannot ��� though chimps and bonobos can be trained to ac- quire a form of communication that approximates the com- plexity of the utterances of a 2-year-old human infant. The approach offered here to the evolution of brain mecha- nisms that support language is anchored in two observa- tions: (1) The system of the monkey brain for visuomotor control of hand movements for grasping has its premotor outpost in an area called F5 which contains a set of neurons, called mirror neurons, each of which is active not only when the monkey executes a specific grasp but also when the monkey observes a human or other monkey execute a more or less similar grasp (Rizzolatti et al. 1996a). Thus F5 in monkey contains a mirror system for grasping which em- ploys a common neural code for executed and observed manual actions (sect. 3.2 provides more details). (2) The re- gion of the human brain homologous to F5 is part of Broca���s area, traditionally thought of as a speech area but which has been shown by brain imaging studies to be active when hu- mans both execute and observe grasps. These findings led to the mirror system hypothesis (Ar- bib & Rizzolatti 1997 Rizzolatti & Arbib 1998, henceforth R&A): The parity requirement for language in humans ��� that what counts for the speaker must count approximately the same for the hearer3 ��� is met because Broca���s area evolved atop the mir- ror system for grasping, with its capacity to generate and rec- ognize a set of actions. One of the contributions of this paper will be to stress that the F5 mirror neurons in the monkey are linked to regions of parietal and temporal cortex, and then argue that the evolutionary changes that ���lifted��� the F5 homologue of the common ancestor of human and monkey to yield the hu- man Broca���s area also ���lifted��� the other regions to yield Wernicke���s area and other areas that support language in the human brain. Many critics have dismissed the mirror system hypothe- sis, stating correctly that monkeys do not have language and so the mere possession of a mirror system for grasping can- not suffice for language. But the key phrase here is ���evolved atop��� ��� and Rizzolatti and Arbib (1998) discuss explicitly how changes in the primate brain might have adapted the use of the hands to support pantomime (intended commu- nication) as well as praxis, and then outlined how further evolutionary changes could support language. The hypoth- esis provides a neurological basis for the oft-repeated claim that hominids had a (proto)language based primarily on manual gestures before they had a (proto)language based primarily on vocal gestures (e.g., Armstrong et al. 1995 Hewes 1973 Kimura 1993 Stokoe 2001).4 It could be tempting to hypothesize that certain species-specific vocal- izations of monkeys (such as the snake and leopard calls of vervet monkeys) provided the basis for the evolution of hu- man speech, since both are in the vocal domain. However, these primate vocalizations appear to be related to non-cor- tical regions as well as the anterior cingulate cortex (see, e.g., J��rgens 1997) rather than F5, the homologue of Broca���s area. I think it likely (though empirical data are sadly lacking) that the primate cortex contains a mirror sys- tem for such species-specific vocalizations, and that a re- lated mirror system persists in humans, but I suggest that it is a complement to, rather than an integral part of, the speech system that includes Broca���s area in humans. The mirror system hypothesis claims that a specific mir- ror system ��� the primate mirror system for grasping ��� evolved into a key component of the mechanisms that ren- der the human brain language-ready. It is this specificity that will allow us to explain below why language is multi- modal, its evolution being based on the execution and ob- servation of hand movements. There is no claim that mir- roring or imitation is limited to primates. It is likely that an analogue of mirror systems exists in other mammals, espe- cially those with a rich and flexible social organization. Moreover, the evolution of the imitation system for learn- ing songs by male songbirds is divergent from mammalian evolution, but for the neuroscientist there are intriguing challenges in plotting the similarities and differences in the neural mechanisms underlying human language and bird- song (Doupe & Kuhl 1999).5 The monkey mirror system for grasping is presumed to allow other monkeys to understand praxic actions and use this understanding as a basis for cooperation, averting a threat, and so on. One might say that this is implicitly com- municative, as a side effect of conducting an action for non- communicative goals. Similarly, the monkey���s orofacial ges- tures register emotional state, and primate vocalizations can also communicate something of the current priorities of the monkey, but to a first order this might be called ���in- voluntary communication���6 ��� these ���devices��� evolved to signal certain aspects of the monkey���s current internal state or situation either through its observable actions or through a fixed species-specific repertoire of facial and vocal ges- tures. I will develop the hypothesis that the mirror system made possible (but in no sense guaranteed) the evolution of the displacement of hand movements from praxis to ges- tures that can be controlled ���voluntarily.��� It is important to be quite clear as to what the mirror sys- tem hypothesis does not say. 1. It does not say that having a mirror system is equiva- lent to having language. Monkeys have mirror systems but do not have language, and I expect that many species have mirror systems for varied socially relevant behaviors. 2. Having a mirror system for grasping is not in itself suf- ficient for the copying of actions. It is one thing to recog- nize an action using the mirror system it is another thing to use that representation as a basis for repeating the action. Hence, further evolution of the brain was required for the mirror system for grasping to become an imitation system for grasping. 3. It does not say that language evolution can be studied in isolation from cognitive evolution more generally. Arbib (2002) modified and developed the R&A argu- ment to hypothesize seven stages in the evolution of lan- guage, with imitation grounding two of the stages.7 The first three stages are pre-hominid: S1: Grasping. S2: A mirror system for grasping shared with the com- mon ancestor of human and monkey. S3: A simple imitation system for object-directed grasp- ing through much-repeated exposure. This is shared with the common ancestor of human and chimpanzee. The next three stages then distinguish the hominid line from that of the great apes: Arbib: From monkey-like action recognition to human language: An evolutionary framework for neurolinguistics 106 BEHAVIORAL AND BRAIN SCIENCES (2005) 28:2
S4: A complex imitation system for grasping ��� the ability to recognize another���s performance as a set of familiar ac- tions and then repeat them, or to recognize that such a per- formance combines novel actions which can be approxi- mated by variants of actions already in the repertoire.8 S5: Protosign, a manual-based communication system, breaking through the fixed repertoire of primate vocaliza- tions to yield an open repertoire. S6: Protospeech, resulting from the ability of control mechanisms evolved for protosign coming to control the vo- cal apparatus with increasing flexibility.9 The final stage is claimed (controversially!) to involve lit- tle if any biological evolution but instead to result from cul- tural evolution (historical change) in Homo sapiens: S7: Language, the change from action-object frames to verb-argument structures to syntax and semantics the co- evolution of cognitive and linguistic complexity. The Mirror System Hypothesis is simply the assertion that the mechanisms that get us to the role of Broca���s area in language depend in a crucial way on the mechanisms es- tablished in stage S2. The above seven stages provide just one set of hypotheses on how this dependence may have arisen. The task of this paper is to re-examine this progres- sion, responding to critiques by amplifying the supporting argument in some cases and tweaking the account in oth- ers. I believe that the overall framework is robust, but there are many details to be worked out and a continuing stream of new and relevant data and modeling to be taken into ac- count. The claim for the crucial role of manual communication in language evolution remains controversial. MacNeilage (1998 MacNeilage & Davis, in press b), for example, has argued that language evolved directly as speech. (A com- panion paper [Arbib 2005] details why I reject MacNeil- age���s argument. The basic point is to distinguish the evolu- tion of the ability to use gestures that convey meaning from the evolution of syllabification as a way to structure vocal gestures.) A note to commentators: The arguments for stages S1 through S6 can and should be evaluated quite indepen- dently of the claim that the transition to language was cul- tural rather than biological. The neurolinguistic approach offered here is part of a performance approach which explicitly analyzes both per- ception and production (Fig. 1). For production, we have much we could possibly talk about which is represented as cognitive structures (cognitive form schema assemblages) from which some aspects are selected for possible expres- sion. Further selection and transformation yields semantic structures (hierarchical constituents expressing objects, ac- tions and relationships) which constitute a semantic form that is enriched by linkage to schemas for perceiving and acting upon the world (Arbib 2003 Rolls & Arbib 2003). Fi- nally, the ideas in the semantic form must be expressed in words whose markings and ordering are expressed in phonological form ��� which may include a wide range of or- dered expressive gestures, whether manual, orofacial, or vocal. For perception, the received sentence must be in- terpreted semantically, with the result updating the ���hearer���s��� cognitive structures. For example, perception of a visual scene may reveal ���Who is doing what and to whom/ which��� as part of a nonlinguistic action-object frame in cog- nitive form. By contrast, the verb-argument structure is an overt linguistic representation in semantic form ��� in mod- ern human languages, generally the action is named by a verb and the objects are named by nouns or noun phrases (see sect. 7). A production grammar for a language is then a specific mechanism (whether explicit or implicit) for con- verting verb-argument structures into strings of words (and hierarchical compounds of verb-argument structures into complex sentences), and vice versa for perception. In the brain there may be no single grammar serving both production and perception, but rather, a ���direct grammar��� for production and an ���inverse grammar��� for perception. Jackendoff (2002) offers a competence theory with a much closer connection with theories of processing than has been common in generative linguistics and suggests (his sect. 9.3) strategies for a two-way dialogue between competence and performance theories. Jackendoff���s approach to compe- tence appears to be promising in this regard because it at- Arbib: From monkey-like action recognition to human language: An evolutionary framework for neurolinguistics BEHAVIORAL AND BRAIN SCIENCES (2005) 28:2 107 Figure 1. A performance view of the production and perception of language.
tends to the interaction of, for example, phonological, syn- tactic, and semantic representations. There is much, too, to be learned from a variety of approaches to cognitive gram- mar which relates cognitive form to syntactic structure (see, e.g., Heine 1997 Langacker 1987 1991 Talmy 2000). The next section provides a set of criteria for language readiness and further criteria for what must be added to yield language. It concludes (sect. 2.3) with an outline of the argument as it develops in the last six sections of the paper. 2. Language, protolanguage, and language readiness I earlier defined a protolanguage as any system of utter- ances which served as a precursor to human language in the modern sense and hypothesized that the first Homo sapiens had protolanguage and a ���language-ready brain��� but did not have language. Contra Bickerton (see Note 1), I will argue in section 7 that the prelanguage of Homo erectus and early Homo sapi- ens was composed mainly of ���unitary utterances��� that sym- bolized frequently occurring situations (in a general sense) without being decomposable into distinct words denoting components of the situation or their relationships. Words as we know them then co-evolved culturally with syntax through fractionation. In this view, many ways of express- ing relationships that we now take for granted as part of lan- guage were the discovery of Homo sapiens for example, ad- jectives and the fractionation of nouns from verbs may be ���post-biological��� in origin. 2.1. Criteria for language readiness Here are properties hypothesized to support protolan- guage: LR1. Complex imitation: The ability to recognize an- other���s performance as a set of familiar movements and then repeat them, but also to recognize that such a perfor- mance combines novel actions that can be approximated by (i.e., more or less crudely be imitated by) variants of actions already in the repertoire.10 The idea is that this capacity ��� distinct from the simple imitation system for object-directed grasping through much repeated exposure which is shared with chimpanzees ��� is necessary to support properties LR2 and LR3, including the idea that symbols are potentially arbitrary rather than innate: LR2. Symbolization: The ability to associate symbols with an open class of episodes, objects, or actions. At first, these symbols may have been unitary utterances, rather than words in the modern sense, and they may have been based on manual and facial gestures rather than being vocalized. LR3. Parity (mirror property): What counts for the speaker (or producer) must count for the listener (or re- ceiver). This extends Property LR2 by ensuring that symbols can be shared, and thus is bound up with LR4. LR4. Intended communication: Communication is in- tended by the utterer to have a particular effect on the re- cipient rather than being involuntary or a side effect of praxis. The remainder are more general properties, delimiting cognitive capabilities that underlie a number of the ideas which eventually find their expression in language: LR5. From hierarchical structuring to temporal order- ing: Perceiving that objects and actions have subparts find- ing the appropriate timing of actions to achieve goals in re- lation to those hierarchically structured objects. A basic property of language ��� translating a hierarchical conceptual structure into a temporally ordered structure of actions ��� is in fact not unique to language but is apparent whenever an animal takes in the nature of a visual scene and produces appropriate behavior. Animals possess subtle mechanisms of action-oriented perception with no neces- sary link to the ability to communicate about these compo- nents and their relationships. To have such structures does not entail the ability to communicate by using words or ar- ticulatory gestures (whether signed or vocalized) in a way that reflects these structures. Hauser et al. (2002) assert that the faculty of language in the narrow sense (FLN) includes only recursion and is the one uniquely human component of the faculty of language. However, the flow diagram given by Byrne (2003) shows that the processing used by a mountain gorilla when prepar- ing bundles of nettle leaves to eat is clearly recursive. Go- rillas (like many other species, and not only mammals) have the working memory to refer their next action not only to sensory data but also to the state of execution of some cur- rent plan. Hence, when we refer to the monkey���s grasping and ability to recognize similar grasps in others, it is a mis- take to treat the individual grasps in isolation ��� the F5 sys- tem is part of a larger system that can direct those grasps as part of a recursively structured plan. Let me simply list the next two properties here, and then expand upon them in the next section: LR6. Beyond the here-and-now 1: The ability to recall past events or imagine future ones. LR7. Paedomorphy and sociality: Paedomorphy is the prolonged period of infant dependency which is especially pronounced in humans this combines with social struc- tures for caregiving to provide the conditions for complex social learning. Where Deacon (1997) makes symbolization central to his account of the coevolution of language and the human brain, the present account will stress the parity property LR3, since it underlies the sharing of meaning, and the ca- pacity for complex imitation. I will also argue that only pro- tolanguage co-evolved with the brain, and that the full de- velopment of linguistic complexity was a cultural/historical process that required little or no further change from the brains of early Homo sapiens. Later sections will place LR1 through LR7 in an evolu- tionary context (see sect. 2.3 for a summary), showing how the coupling of complex imitation to complex communica- tion creates a language-ready brain. 2.2. Criteria for language I next present four criteria for what must be added to the brain���s capabilities for the parity, hierarchical structuring, and temporal ordering of language readiness to yield lan- guage. Nothing in this list rests on the medium of exchange of the language, applying to spoken language, sign lan- guage, or written language, for example. My claim is that a brain that can support properties LR1 through LR7 above can support properties LA1 through LA4 below ��� as long Arbib: From monkey-like action recognition to human language: An evolutionary framework for neurolinguistics 108 BEHAVIORAL AND BRAIN SCIENCES (2005) 28:2
as its ���owner��� matures in a society that possesses language in the sense so defined and nurtures the child to acquire it. In other words, I claim that the mechanisms that make LR1 through LR7 possible are supported by the genetic encod- ing of brain and body and the consequent space of possible social interactions, but that the genome has no additional structures specific to LA1 through LA4. In particular, the genome does not have special features encoding syntax and its linkage to a compositional semantics.11 I suggest that ���true language��� involves the following fur- ther properties beyond LR1 through LR7: LA1. Symbolization and compositionality: The symbols become words in the modern sense, interchangeable and composable in the expression of meaning.12 LA2. Syntax, semantics and recursion: The matching of syntactic to semantic structures coevolves with the frac- tionation of utterances, with the nesting of substructures making some form of recursion inevitable. LA1 and LA2 are intertwined. Section 7 will offer candi- dates for the sorts of discoveries that may have led to progress from ���unitary utterances��� to more or less struc- tured assemblages of words. Given the view (LR5) that re- cursion of action (but not of communication) is part of lan- guage readiness, the key transition here is the compositionality that allows cognitive structure to be re- flected in symbolic structure (the transition from LR2 to LA1), as when perception (not uniquely human) grounds linguistic description (uniquely human) so that, for exam- ple, the noun phrase (NP) describing a part of an object may optionally form part of the NP describing the overall object. From this point of view, recursion in language is a corollary of the essentially recursive nature of action and perception once symbolization becomes compositional, and reflects addition of further detail to, for example, a descrip- tion when needed to reduce ambiguity in communication. The last two principles provide the linguistic comple- ments of two of the conditions for language readiness, LR6 (Beyond the here-and-now 1) and LR7 (Paedomorphy and sociality), respectively. LA3. Beyond the here-and-now 2: Verb tenses or other circumlocutions express the ability to recall past events or imagine future ones. There are so many linguistic devices for going beyond the here and now, and beyond the factual, that verb tenses are mentioned to stand in for all the devices languages have de- veloped to communicate about other ���possible worlds��� that are far removed from the immediacy of, say, the vervet monkey���s leopard call. If one took a human language and removed all reference to time, one might still want to call it a language rather than a protolanguage, even though one would agree that it was thereby greatly impoverished. Similarly, the number sys- tem of a language can be seen as a useful, but not defini- tive, ���plug-in.��� LA3 nonetheless suggests that the ability to talk about past and future is a central part of human lan- guages as we understand them. However, all this would be meaningless (literally) without the underlying cognitive machinery ��� the substrate for episodic memory provided by the hippocampus (Burgess et al. 1999) and the substrate for planning provided by frontal cortex (Passingham 1993, Ch. 10). It is not part of the mirror system hypothesis to explain the evolution of the brain structures that support LR6 it is an exciting challenge for work ���beyond the mirror��� to show how such structures could provide the basis for humans to discover the capacities for communication summarized in LA3. LA4. Learnability: To qualify as a human language, much of the syntax and semantics of a human language must be learnable by most human children. I say ���much of��� because it is not true that children mas- ter all the vocabulary or syntactic subtlety of a language by 5 or 7 years of age. Language acquisition is a process that continues well into the teens as we learn more subtle syn- tactic expressions and a greater vocabulary to which to ap- ply them (C. Chomsky  traces the changes that occur from ages 5 to 10), allowing us to achieve a richer and richer set of communicative and representational goals. LR7 and LA4 link a biological condition ���orthogonal��� to the mirror system hypothesis with a ���supplementary��� prop- erty of human languages. This supplementary property is that languages do not simply exist ��� they are acquired anew (and may be slightly modified thereby) in each generation (LA4). The biological property is an inherently social one about the nature of the relationship between parent (or other caregiver) and child (LR7) ��� the prolonged period of infant dependency which is especially pronounced in hu- mans has co-evolved with the social structures for caregiv- ing that provide the conditions for the complex social learn- ing that makes possible the richness of human cultures in general and of human languages in particular (Tomasello 1999b). 2.3. The argument in perspective The argument unfolds in the remaining six sections as fol- lows: Section 3. Perspectives on grasping and mirror neurons: This section presents two models of the macaque brain. A key point is that the functions of mirror neurons reflect the impact of experience rather than being pre-wired. Section 4. Imitation: This section presents the distinction between simple and complex imitation systems for grasp- ing, and argues that monkeys have neither, that chim- panzees have only simple imitation, and that the capacity for complex imitation involved hominid evolution since the separation from our common ancestors, the great apes, in- cluding chimpanzees. Section 5. From imitation to protosign: This section ex- amines the relation between symbolism, intended commu- nication, and parity, and looks at the multiple roles of the mirror system in supporting pantomime and then conven- tionalized gestures that support a far greater range of in- tended communication. Section 6. The emergence of protospeech: This section ar- gues that evolution did not proceed directly from monkey- like primate vocalizations to speech but rather proceeded from vocalization to manual gesture and back to vocaliza- tion again. Section 7. The inventions of languages: This section ar- gues that the transition from action-object frames to verb- argument structures embedded in larger sentences struc- tured by syntax and endowed with a compositional semantics was the effect of the accumulation of a wide range of human discoveries that had little if any impact on the human genome. Section 8. Toward a neurolinguistics ���beyond the mir- ror���: This section extracts a framework for action-oriented linguistics informed by our analysis of the ���extended mirror Arbib: From monkey-like action recognition to human language: An evolutionary framework for neurolinguistics BEHAVIORAL AND BRAIN SCIENCES (2005) 28:2 109
system hypothesis��� presented in the previous sections. The language-ready brain contains the evolved mirror system as a key component but also includes many other components that lie outside, though they interact with, the mirror sys- tem. Table 1 shows how these sections relate the evolutionary stages S1 through S7, and their substages, to the above cri- teria for language readiness and language.13 3. Perspectives on grasping and mirror neurons Mirror neurons in F5, which are active both when the mon- key performs certain actions and when the monkey ob- serves them performed by others, are to be distinguished from canonical neurons in F5, which are active when the monkey performs certain actions but not when the monkey observes actions performed by others. More subtly, canon- ical neurons fire when they are presented with a graspable object, irrespective of whether the monkey performs the grasp or not ��� but clearly this must depend on the extra (in- ferred) condition that the monkey not only sees the object but is aware, in some sense, that it is possible to grasp it. Were it not for the caveat, canonical neurons would also fire when the monkey observed the object being grasped by an- other. The ���classic��� mirror system hypothesis (sect. 1.2) em- phasizes the grasp-related neurons of the monkey premo- tor area F5 and the homology of this region with human Broca���s area. However, Broca���s area is part of a larger sys- tem supporting language, and so we need to enrich the mir- ror system hypothesis by seeing how the mirror system for grasping in monkey includes a variety of brain regions in ad- dition to F5. I show this by presenting data and models that locate the canonical system of F5 in a systems perspective (the FARS model of sect. 3.1) and then place the mirror system of F5 in a system perspective (the MNS model of sect. 3.2). 3.1. The FARS model Given our concern with hand use and language, it is strik- ing that the ability to use the size of an object to preshape the hand while grasping it can be dissociated by brain le- sions from the ability to consciously recognize and describe that size. Goodale et al. (1991) studied a patient (D.F.) whose cortical damage allowed signals to flow from primary Arbib: From monkey-like action recognition to human language: An evolutionary framework for neurolinguistics 110 BEHAVIORAL AND BRAIN SCIENCES (2005) 28:2 Table 1. A comparative view of how the following sections relate the criteria LR1���LR for language readiness and LA1���LA2 for language (middle column) to the seven stages, S1���S7, of the extended mirror system hypothesis (right column) Section Criteria Stages 2.1 LR5: From hierarchical structuring to This precedes the evolutionary stages charted here. temporal ordering 3.1 S1: Grasping The FARS model. 3.2 S2: Mirror system for grasping Modeling Development of the Mirror System. This supports the conclusion that mirror neurons can be recruited to recognize and encode an expanding set of novel actions. 4 LR1: Complex imitation S3: Simple imitation This involves properties of the mirror system beyond the monkey���s data. S4: Complex imitation This is argued to distinguish humans from other primates. 5 LR2: Symbolization S5: Protosign LR4: Intended communication The transition of complex imitation from praxic to LR3: Parity (mirror property) communicative use involves two substages: S5a: the ability to engage in pantomime S5b: the ability to make conventional gestures to disambiguate pantomime. 6.1 S6: Protospeech It is argued that early protosign provided the scaffolding for early protospeech, after which both developed in an ex- panding spiral until protospeech became dominant for most people. 7 LA1: Symbolization and compositionality S7: Language LA2: Syntax, semantics, and recursion The transition from action-object frame to verb-argument structure to syntax and semantics. 8 The evolutionary developments of the preceding sections are restructured into synchronic form to provide a framework for further research in neurolinguistics relating the capa- bilities of the human brain for language, action recogni- tion, and imitation.
visual cortex (V1) towards posterior parietal cortex (PP) but not from V1 to inferotemporal cortex (IT). When asked to indicate the width of a single block by means of her index finger and thumb, D.F.���s finger separation bore no rela- tionship to the dimensions of the object and showed con- siderable trial-to-trial variability. Yet when she was asked simply to reach out and pick up the block, the peak aper- ture (well before contact with the object) between her in- dex finger and thumb changed systematically with the width of the object, as in normal controls. A similar disso- ciation was seen in her responses to the orientation of stim- uli. In other words, D.F. could preshape accurately, even though she appeared to have no conscious appreciation (ex- pressible either verbally or in pantomime) of the visual pa- rameters that guided the preshape. Jeannerod et al. (1994) reported a study of impairment of grasping in a patient (A.T.) with a bilateral posterior parietal lesion of vascu- lar origin that left IT and the pathway V1 r IT relatively in- tact, but grossly impaired the pathway V1 r PP. This pa- tient can reach without deficit toward the location of such an object, but cannot preshape appropriately when asked to grasp it. A corresponding distinction in the role of these pathways in the monkey is crucial to the FARS model (named for Fagg, Arbib, Rizzolatti, and Sakata see Fagg & Arbib 1998), which embeds F5 canonical neurons in a larger sys- tem. Taira et al. (1990) found that anterior intraparietal (AIP) cells (in the anterior intraparietal sulcus of the pari- etal cortex) extract neural codes for affordances for grasp- ing from the visual stream and sends these on to area F5. Affordances (Gibson 1979) are features of the object rele- vant to action, in this case to grasping, rather than aspects of identifying the object���s identity. Turning to human data: Ehrsson et al. (2003) compared the brain activity when hu- mans attempted to lift an immovable test object held be- tween the tips of the right index finger and thumb with the brain activity obtained in two control tasks in which neither the load force task nor the grip force task involved coordi- nated grip-load forces. They found that the grip-load force task was specifically associated with activation of a section of the right intraparietal cortex. Culham et al. (2003) found greater activity for grasping than for reaching in several re- gions, including the anterior intraparietal (AIP) cortex. Al- though the lateral occipital complex (LOC), a ventral stream area believed to play a critical role in object recog- nition, was activated by the objects presented on both grasping and reaching trials, there was no greater activity for grasping compared to reaching. The FARS model analyzes how the ���canonical system,��� centered on the AIP r F5 pathway, may account for basic phenomena of grasping. The highlights of the model are shown in Figure 2,14 which diagrams the crucial role of IT (inferotemporal cortex) and PFC (prefrontal cortex) in modulating F5���s selection of an affordance. The dorsal stream (from V1 to parietal cortex) carries the information needed for AIP to recognize that different parts of the ob- ject can be grasped in different ways, thus extracting affor- dances for the grasp system which are then passed on to F5. The dorsal stream does not know ���what��� the object is it can only see the object as a set of possible affordances. The ven- tral stream (from V1 to IT), by contrast, is able to recognize what the object is. This information is passed to PFC, which can then, on the basis of the current goals of the organism and the recognition of the nature of the object, bias AIP to choose the affordance appropriate to the task at hand. The original FARS model posited connections between PFC and F5. However, there is evidence (reviewed by Rizzolatti & Luppino 2001) that these connections are very limited, whereas rich connections exist between PFC and AIP. Riz- zolatti and Luppino (2003) therefore suggested that FARS Arbib: From monkey-like action recognition to human language: An evolutionary framework for neurolinguistics BEHAVIORAL AND BRAIN SCIENCES (2005) 28:2 111 Figure 2. A reconceptualization of the FARS model in which the primary influence of PFC (prefrontal cortex) on the selection of af- fordances is on parietal cortex (AIP, anterior intraparietal sulcus) rather than premotor cortex (the hand area F5).
be modified so that information on object semantics and the goals of the individual influence AIP rather than F5 neurons. I show the modified schematic in Figure 2. The modified figure represents the way in which AIP may ac- cept signals from areas F6 (pre-SMA), 46 (dorsolateral pre- frontal cortex), and F2 (dorsal premotor cortex) to respond to task constraints, working memory, and instruction stim- uli, respectively. In other words, AIP provides cues on how to interact with an object, leaving it to IT to categorize the object or determine its identity. Although the data on cell specificity in F5 and AIP em- phasize single actions, these actions are normally part of more complex behaviors ��� to take a simple example, a mon- key who grasps a raisin will, in general, then proceed to eat it. Moreover, a particular action might be part of many learned sequences, and so we do not expect the premotor neurons for one action to prime a single possible conse- quent action and hence must reject ���hard wiring��� of the se- quence. The generally adopted solution is to segregate the learning of a sequence from the circuitry which encodes the unit actions, the latter being F5 in the present study. In- stead, another area (possibly the part of the supplementary motor area called pre-SMA Rizzolatti et al. 1998) has neu- rons whose connections encode an ���abstract sequence��� Q1, Q2, Q3, Q4, with sequence learning then involving learn- ing that the activation of Q1 triggers the F5 neurons for A, Q2 triggers B, Q3 triggers A again, and Q4 triggers C to pro- vide encoding of the sequence A-B-A-C. Other studies sug- gest that administration of the sequence (inhibiting extra- neous actions, while priming imminent actions) is carried out by the basal ganglia on the basis of its interactions with the pre-SMA (Bischoff-Grethe et al. 2003 see Dominey et al. 1995 for an earlier model of the possible role of the basal ganglia in sequence learning). 3.2. Modeling development of the mirror system The populations of canonical and mirror neurons appear to be spatially segregated in F5 (Rizzolatti & Luppino 2001). Both sectors receive a strong input from the secondary so- matosensory area (SII) and parietal area PF. In addition, canonical neurons are the selective target of area AIP. Per- rett et al. (1990 cf. Carey et al. 1997) found that STSa, in the rostral part of the superior temporal sulcus (STS), has neurons which discharge when the monkey observes such biological actions as walking, turning the head, bending the torso, and moving the arms. Of most relevance to us is that a few of these neurons discharged when the monkey ob- served goal-directed hand movements, such as grasping ob- jects (Perrett et al. 1990) ��� though STSa neurons do not seem to discharge during movement execution as distinct from observation. STSa and F5 may be indirectly con- nected via the inferior parietal area PF (Brodmann area 7b) (Cavada & Goldman-Rakic 1989 Matelli et al. 1986 Petrides & Pandya 1984 Seltzer & Pandya 1994). About 40% of the visually responsive neurons in PF are active for observation of actions such as holding, placing, reaching, grasping, and bimanual interaction. Moreover, most of these action-observation neurons were also active during the execution of actions similar to those for which they were ���observers,��� and were therefore called PF mirror neurons (Fogassi et al. 1998). In summary, area F5 and area PF include an observation/ execution matching system: When the monkey observes an action that resembles one in its movement repertoire, a subset of the F5 and PF mirror neurons is activated which also discharges when a similar action is executed by the monkey itself. I next develop the conceptual framework for thinking about the relation between F5, AIP, and PF. Section 6.1 ex- pands the mirror neuron database, reviewing the reports by Kohler et al. (2002) of a subset of mirror neurons respon- sive to sounds and by Ferrari et al. (2003) of neurons re- sponsive to the observation of orofacial communicative ges- tures. Figure 3 provides a glimpse of the schemas (functions) involved in the MNS model (Oztop & Arbib 2002) of the monkey mirror system.15 First, we look at those elements involved when the monkey itself reaches for an object. Ar- eas IT and cIPS (caudal intraparietal sulcus part of area 7) provide visual input concerning the nature of the observed object and the position and orientation of the object���s sur- faces, respectively, to AIP. The job of AIP is then to extract the affordances the object offers for grasping. The upper di- agonal in Figure 3 corresponds to the basic pathway AIP r F5canonical r M1 (primary motor cortex) of the FARS model, but Figure 3 does not include the important role of PFC in action selection. The lower-right diagonal (MIP/ LIP/VIP r F4) completes the ���canonical��� portion of the MNS model, since motor cortex must instruct not only the hand muscles how to grasp but also (via various intermedi- aries) the arm muscles how to reach, transporting the hand to the object. The rest of Figure 3 presents the core ele- ments for the understanding of the mirror system. Mirror neurons do not fire when the monkey sees the hand move- ment or the object in isolation ��� it is the sight of the hand moving appropriately to grasp or otherwise manipulate a seen (or recently seen) object (Umilt�� et al. 2001) that is re- quired for the mirror neurons attuned to the given action to fire. This requires schemas for the recognition of both the shape of the hand and analysis of its motion (ascribed in the figure to STS), and for analysis of the relation of these hand parameters to the location and affordance of the ob- ject (7a and 7b we identify 7b with PF). In the MNS model, the hand state was accordingly de- fined as a vector whose components represented the move- ment of the wrist relative to the location of the object and of the hand shape relative to the affordances of the object. Oztop and Arbib (2002) showed that an artificial neural net- work corresponding to PF and F5mirror could be trained to recognize the grasp type from the hand state trajectory, with correct classification often being achieved well before the hand reached the object. The modeling assumed that the neural equivalent of a grasp being in the monkey���s repertoire is that there is a pattern of activity in the F5 canonical neurons which commands that grasp. During training, the output of the F5 canonical neurons, acting as a code for the grasp being executed by the monkey at that time, was used as the training signal for the F5 mirror neu- rons to enable them to learn which hand-object trajectories corresponded to the canonically encoded grasps. Moreover, the input to the F5 mirror neurons encodes the trajectory of the relation of parts of the hand to the object rather than the visual appearance of the hand in the visual field. As a result of this training, the appropriate mirror neurons come to fire in response to viewing the appropriate trajectories even when the trajectory is not accompanied by F5 canon- ical firing. Arbib: From monkey-like action recognition to human language: An evolutionary framework for neurolinguistics 112 BEHAVIORAL AND BRAIN SCIENCES (2005) 28:2
This training prepares the F5 mirror neurons to respond to hand-object relational trajectories even when the hand is of the ���other��� rather than the ���self,��� because the hand state is based on the movement of a hand relative to the object, and thus only indirectly on the retinal input of seeing hand and object which can differ greatly between observation of self and other. What makes the modeling worthwhile is that the trained network not only responded to hand-state tra- jectories from the training set, but also exhibited interest- ing responses to novel hand-object relationships. Despite the use of a non-physiological neural network, simulations with the model revealed a range of putative properties of mirror neurons that suggest new neurophysiological exper- iments. (See Oztop & Arbib  for examples and de- tailed analysis.) Although MNS was constructed as a model of the devel- opment of mirror neurons in the monkey, it serves equally well as a model of the development of mirror neurons in the human infant. A major theme for future modeling, then, will be to clarify which aspects of human development are generic for primates and which are specific to the human repertoire. In any case, the MNS model makes the crucial assumption that the grasps that the mirror system comes to recognize are already in the (monkey or human) infant���s repertoire. But this raises the question of how grasps en- tered the repertoire. To simplify somewhat, the answer has two parts: (1) Children explore their environment, and as their initially inept arm and hand movements successfully contact objects, they learn to reproduce the successful grasps reliably, with the repertoire being tuned through fur- ther experience. (2) With more or less help from caregivers, infants come to recognize certain novel actions in terms of similarities with and differences from movements already in their repertoires, and on this basis learn to produce some version of these novel actions for themselves. Our Infant Learning to Grasp Model (ILGM Oztop et al. 2004) strongly supports the hypothesis that grasps are acquired through experience as the infant learns how to conform the biomechanics of its hand to the shapes of the objects it en- counters. However, limited space precludes presentation of this model here. The classic papers on the mirror system for grasping in the monkey focus on a repertoire of grasps ��� such as the precision pinch and power grasp ��� that seem so basic that it is tempting to think of them as prewired. The crucial point of this section on modeling is that learning models such as ILGM and MNS, and the data they address, make clear that mirror neurons are not restricted to recognition of an innate set of actions but can be recruited to recognize and encode an expanding repertoire of novel actions. I will relate the FARS and MNS models to the development of imitation at the end of section 4. With this, let us turn to human data. We mentioned in section 1.2 that Broca���s area, traditionally thought of as a speech area, has been shown by brain imaging studies to be active when humans both execute and observe grasps. This was first tested by two positron emission tomography (PET) experiments (Grafton et al. 1996 Rizzolatti et al. 1996) which compared brain activation when subjects observed the experimenter grasping an object against activation when subjects simply observed the object. Grasp observa- tion significantly activated the superior temporal sulcus (STS), the inferior parietal lobule, and the inferior frontal gyrus (area 45). All activations were in the left hemisphere. The last area is of especial interest because areas 44 and 45 in the left hemisphere of the human constitute Broca���s area. Such data certainly contribute to the growing body of indi- rect evidence that there is a mirror system for grasping that Arbib: From monkey-like action recognition to human language: An evolutionary framework for neurolinguistics BEHAVIORAL AND BRAIN SCIENCES (2005) 28:2 113 Figure 3. A schematic view of the Mirror Neuron System (MNS) model (Oztop & Arbib 2002).
links Broca���s area with regions in the inferior parietal lob- ule and STS. We have seen that the ���minimal mirror sys- tem��� for grasping in the macaque includes mirror neurons in the parietal area PF (7b) as well as F5, and some not- quite-mirror neurons in the region STSa in the superior temporal sulcus. Hence, in further investigation of the mir- ror system hypothesis it will be crucial to extend the F5 r Broca���s area homology to examine the human homologues of PF and STSa as well. I will return to this issue in section 7 (see Fig. 6) and briefly review some of the relevant data from the rich and rapidly growing literature based on hu- man brain imaging and transcranial magnetic stimulation (TMS) inspired by the effort to probe the human mirror system and relate it to action recognition, imitation, and language. Returning to the term ���language readiness,��� let me stress that the reliable linkage of brain areas to different aspects of language in normal speaking humans does not imply that language per se is ���genetically encoded��� in these regions. There is a neurology of writing even though writing was in- vented only a few thousand years ago. The claim is not that Broca���s area, Wernicke���s area, and STS are genetically pre- programmed for language, but rather that the development of a human child in a language community normally adapts these brain regions to play a crucial (but not the only) role in language performance. 4. Imitation We have already discussed the mirror system for grasping as something shared between macaque and human hence the hypothesis that this set of mechanisms was already in place in the common ancestor of monkey and human some 20 million years ago.16 In this section we move from stage S2, a mirror system for grasping, to stages S3, a simple im- itation system for grasping, and S4, a complex imitation sys- tem for grasping. I will argue that chimpanzees possess a capability for simple imitation that monkeys lack, but that humans have complex imitation whereas other primates do not. The ability to copy single actions is just the first step to- wards complex imitation, which involves parsing a complex movement into more or less familiar pieces and then per- forming the corresponding composite of (variations on) fa- miliar actions. Arbib and Rizzolatti (1997) asserted that what makes a movement into an action is that it is associ- ated with a goal, and that initiation of the movement is ac- companied by the creation of an expectation that the goal will be met. Hence, it is worth stressing that when I speak of imitation here, I speak of the imitation of a movement and its linkage to the goals it is meant to achieve. The ac- tion may thus vary from occasion to occasion depending on parametric variations in the goal. This is demonstrated by Byrne���s (2003) description, noted earlier, of a mountain go- rilla preparing bundles of nettle leaves to eat. Visalberghi and Fragaszy (2002) review data on attempts to observe imitation in monkeys, including their own stud- ies of capuchin monkeys. They stress the huge difference between the major role that imitation plays in learning by human children, and the very limited role, if any, that imi- tation plays in social learning in monkeys. There is little ev- idence for vocal imitation in monkeys or apes (Hauser 1996), but it is generally accepted that chimpanzees are ca- pable of some forms of imitation (Tomasello & Call 1997). There is not space here to analyze all the relevant dis- tinctions between imitation and other forms of learning, but one example may clarify my view: Voelkl and Huber (2000) had marmosets observe a demonstrator removing the lids from a series of plastic canisters to obtain a meal- worm. When subsequently allowed access to the canisters, marmosets that observed a demonstrator using its hands to remove the lids used only their hands. In contrast, mar- mosets that observed a demonstrator using its mouth also used their mouths to remove the lids. Voelkl and Huber (2000) suggest that this may be a case of true imitation in marmosets, but I would argue that it is a case of stimulus enhancement, apparent imitation resulting from directing attention to a particular object or part of the body or envi- ronment. This is to be distinguished from emulation (ob- serving and attempting to reproduce results of another���s ac- tions without paying attention to details of the other���s behavior) and true imitation which involves copying a novel, otherwise improbable action or some act that is out- side the imitator���s prior repertoire. Myowa-Yamakoshi and Matsuzawa (1999) observed in a laboratory setting that chimpanzees typically took 12 trials to learn to ���imitate��� a behavior and in doing so paid more attention to where the manipulated object was being di- rected than to the actual movements of the demonstrator. This involves the ability to learn novel actions which may require using one or both hands to bring two objects into relationship, or to bring an object into relationship with the body. Chimpanzees do use and make tools in the wild, with dif- ferent tool traditions found in geographically separated groups of chimpanzees: Boesch and Boesch (1983) have ob- served chimpanzees in Tai National Park, Ivory Coast, us- ing stone tools to crack nuts open, although Goodall has never seen chimpanzees do this in the Gombe in Tanzania. They crack harder-shelled nuts with stone hammers and stone anvils. The Tai chimpanzees live in a dense forest where suitable stones are hard to find. The stone anvils are stored in particular locations to which the chimpanzees continually return.17 The nut-cracking technique is not mastered until adulthood. Tomasello (1999b) comments that, over many years of observation, Boesch observed only two possible instances in which the mother appeared to be actively attempting to instruct her child, and that even in these cases it is unclear whether the mother had the goal of helping the young chimp learn to use the tool. We may con- trast the long and laborious process of acquiring the nut- cracking technique with the rapidity with which human adults can acquire novel sequences, and the crucial role of caregivers in the development of this capacity for complex imitation. Meanwhile, reports abound of imitation in many species, including dolphins and orangutans, and even tool use in crows (Hunt & Gray 2002). Consequently, I accept that the demarcation between the capability for imitation of humans and nonhumans is problematic. Nonetheless, I still think it is fair to claim that humans can master feats of imitation beyond those possible for other primates. The ability to imitate has clear adaptive advantage in al- lowing creatures to transfer skills to their offspring, and therefore could be selected for quite independently of any adaptation related to the later emergence of protolanguage. By the same token, the ability for complex imitation could provide further selective advantage unrelated to language. However, complex imitation is central to human infants Arbib: From monkey-like action recognition to human language: An evolutionary framework for neurolinguistics 114 BEHAVIORAL AND BRAIN SCIENCES (2005) 28:2
both in their increasing mastery of the physical and social world and in the close coupling of this mastery to the acquisition of language (cf. Donald 1998 Arbib et al., in press). The child must go beyond simple imitation to ac- quire the phonological repertoire, words, and basic ���as- sembly skills��� of its language community, and this is one of the ways in which brain mechanisms supporting imitation were crucial to the emergence of language-ready Homo sapiens. If I then assume (1) that the common ancestor of monkeys and apes had no greater imitative ability than pre- sent-day monkeys (who possess, I suggest, stimulus en- hancement rather than simple imitation), and (2) that the ability for simple imitation shared by chimps and humans was also possessed by their common ancestor, but (3) that only humans possess a talent for ���complex��� imitation, then I have established a case for the hypothesis that extension of the mirror system from recognizing single actions to be- ing able to copy compound actions was the key innovation in the brains of our hominid ancestors that was relevant to language. And, more specifically, we have the hypotheses: Stage S3 hypothesis: Brain mechanisms supporting a simple imitation system ��� imitation of short, novel se- quences of object-directed actions through repeated expo- sure ��� for grasping developed in the 15-million-year evolu- tion from the common ancestor of monkeys and apes to the common ancestor of apes and humans and Stage S4 hypothesis: Brain mechanisms supporting a complex imitation system ��� acquiring (longer) novel se- quences of more abstract actions in a single trial ��� devel- oped in the 5-million-year evolution from the common an- cestor of apes and humans along the hominid line that led, in particular, to Homo sapiens.18 Now that we have introduced imitation, we can put the models of section 3.2 in perspective by postulating the fol- lowing stages prior to, during, and building on the devel- opment of the mirror system for grasping in the infant: A. The child refines a crude map (superior colliculus) to make unstructured reach and ���swipe��� movements at ob- jects the grasp reflex occasionally yields a successful grasp. B. The child develops a set of grasps which succeed by kinesthetic, somatosensory criteria (ILGM). C. AIP develops as affordances of objects become learned in association with successful grasps. Grasping be- comes visually guided the grasp reflex disappears. D. The (grasp) mirror neuron system develops driven by visual stimuli relating hand and object generated by the ac- tions (grasps) performed by the infant himself (MNS). E. The child gains the ability to map other individual���s ac- tions into his internal motor representation. F. Then the child acquires the ability to imitate, creating (internal) representations for novel actions that have been observed and developing an action prediction capability. I suggest that stages A through D are much the same in monkey and human, but that stages E and F are rudimen- tary at best in monkeys, somewhat developed in chimps, and well-developed in human children (but not in infants). In terms of Figure 3, we might say that if MNS were aug- mented to have a population of mirror neurons that could acquire population codes for observed actions not yet in the repertoire of self-actions, then in stage E the mirror neu- rons would provide training for the canonical neurons, re- versing the information flow seen in the MNS model. Note that this raises the further possibility that the human infant may come to recognize movements that not only are not within the repertoire but which never come to be within the repertoire. In this case, the cumulative development of ac- tion recognition may proceed to increase the breadth and subtlety of the range of actions that are recognizable but cannot be performed by children. 5. From imitation to protosign The next posited transition, from stage S4, a complex imi- tation system for grasping, to stage S5, protosign, a manual- based communication system, takes us from imitation for the sake of instrumental goals to imitation for the sake of communication. Each stage builds on, yet is not simply re- ducible to, the previous stage. I argue that the combination of the abilities (S5a) to en- gage in pantomime and (S5b) to make conventional ges- tures to disambiguate pantomime yielded a brain which could (S5) support ���protosign,��� a manual-based communi- cation system that broke through the fixed repertoire of pri- mate vocalizations to yield an open repertoire of commu- nicative gestures. It is important to stress that communication is about far more than grasping. To pantomime the flight of a bird, you might move your hand up and down in a way that indicates the flapping of a wing. Your pantomime uses movements of the hand (and arm and body) to imitate movement other than hand movements. You can pantomime an object either by miming a typical action by or with the object, or by trac- ing out the characteristic shape of the object. The transition to pantomime does seem to involve a gen- uine neurological change. Mirror neurons for grasping in the monkey will fire only if the monkey sees both the hand movement and the object to which it is directed (Umilt�� et al. 2001). A grasping movement that is not made in the presence of a suitable object, or is not directed toward that object, will not elicit mirror neuron firing. By contrast, in pantomime, the observer sees the movement in isolation and infers (1) what non-hand movement is being mimicked by the hand movement, and (2) the goal or object of the ac- tion. This is an evolutionary change of key relevance to lan- guage readiness. Imitation is the generic attempt to repro- duce movements performed by another, whether to master a skill or simply as part of a social interaction. By contrast, pantomime is performed with the intention of getting the observer to think of a specific action, object, or event. It is essentially communicative in its nature. The imitator ob- serves the pantomimic intends to be observed. As Stokoe (2001) and others emphasize, the power of pantomime is that it provides open-ended communication that works without prior instruction or convention. How- ever (and I shall return to this issue at the end of this sec- tion), even signs of modern signed language which resem- ble pantomimes are conventionalized and are, thus, distinct from pantomimes. Pantomime per se is not a form of pro- tolanguage rather it provides a rich scaffolding for the emergence of protosign. All this assumes rather than provides an explanation for LR4, the transition from making praxic movement ��� for ex- ample, those involved in the immediate satisfaction of some appetitive or aversive goal ��� to those intended by the ut- terer to have a particular effect on the recipient. I tenta- tively offer: The intended communication hypothesis: The ability to Arbib: From monkey-like action recognition to human language: An evolutionary framework for neurolinguistics BEHAVIORAL AND BRAIN SCIENCES (2005) 28:2 115
imitate combines with the ability to observe the effect of such imitation on conspecifics to support a migration of closed species-specific gestures supported by other brain regions to become the core of an open class of commu- nicative gestures. Darwin (1872/1965) observed long ago, across a far wider range of mammalian species than just the primates, that the facial expressions of conspecifics provide valuable cues to their likely reaction to certain courses of behavior (a rich complex summarized as ���emotional state���). Moreover, the F5 region contains orofacial cells as well as manual cells. This suggests a progression from control of emotional ex- pression by systems that exclude F5 to the extension of F5���s mirror capacity for orofacial as well as manual movement (discussed below), via its posited capacity (achieved by stage S3) for simple imitation, to support the imitation of emotional expressions. This would then provide the ability to affect the behavior of others by, for example, appearing angry. This would in turn provide the evolutionary oppor- tunity to generalize the ability of F5 activity to affect the be- havior of conspecifics from species-specific vocalizations to a general ability to use the imitation of behavior (as distinct from praxic behavior itself) as a means to influence others. This in turn makes possible reciprocity by a process of back- ward chaining where the influence is not so much on the praxis of the other as on the exchange of information. With this, the transition described by LR4 (intended communi- cation) has been achieved in tandem with the achievement and increasing sophistication of LR2 (symbolization). A further critical change (labeled 5b above) emerges from the fact that in pantomime it might be hard to distin- guish, for example, a movement signifying ���bird��� from one meaning ���flying.��� This inability to adequately convey shades of meaning using ���natural��� pantomime would favor the invention of gestures that could in some way disam- biguate which of their associated meanings was intended. Note that whereas a pantomime can freely use any move- ment that might evoke the intended observation in the mind of the observer, a disambiguating gesture must be conventionalized.19 This use of non-pantomimic gestures requires extending the use of the mirror system to attend to an entirely new class of hand movements. However, this does not seem to require a biological change beyond that limned above for pantomime. As pantomime begins to use hand movements to mime different degrees of freedom (as in miming the flying of a bird), a dissociation begins to emerge. The mirror system for the pantomime (based on movements of face, hand, etc.) is now different from the recognition system for the action that is pantomimed, and ��� as in the case of flying ��� the action may not even be in the human action repertoire. However, the system is still able to exploit the praxic recog- nition system because an animal or hominid must observe much about the environment that is relevant to its actions but is not in its own action repertoire. Nonetheless, this dis- sociation now underwrites the emergence of protosign ��� an open system of actions that are defined only by their com- municative impact, not by their direct relation to praxic goals. Protosign may lose the ability of the original pantomime to elicit a response from someone who has not seen it be- fore. However, the price is worth paying in that the simpli- fied form, once agreed upon by the community, allows more rapid communication with less neural effort. One may see analogies in the history of Chinese characters. The char- acter (san) may not seem particularly pictorial, but if (following the ���etymology��� of Vaccari & Vaccari 1961), we see it as a simplification of a picture of three mountains, , via such intermediate forms as , then we have no trouble seeing the simplified character as meaning ���moun- tain.���20 The important point here for our hypothesis is that although such a ���picture history��� may provide a valuable crutch to some learners, with sufficient practice the crutch is thrown away, and in normal reading and writing, the link between and its meaning is direct, with no need to in- voke an intermediate representation of . In the same way, I suggest that pantomime is a valuable crutch for acquiring a modern sign language, but that even signs which resemble pantomimes are conventionalized and are thus distinct from pantomimes.21 Interestingly, Emmorey (2002, Ch. 9) discusses studies of signers using ASL which show a dissociation between the neural systems involved in sign language and those involved in conven- tionalized gesture and pantomime. Corina et al. (1992b) re- ported left-hemisphere dominance for producing ASL signs, but no laterality effect when subjects had to produce symbolic gestures (e.g., waving good-bye or thumbs-up). Other studies report patients with left-hemisphere damage who exhibited sign language impairments but well-pre- served conventional gesture and pantomime. Corina et al. (1992a) described patient W.L. with damage to left-hemi- sphere perisylvian regions. W.L. exhibited poor sign lan- guage comprehension and production. Nonetheless, this patient could produce stretches of pantomime and tended to substitute pantomimes for signs, even when the pan- tomime required more complex movement. Emmorey sees such data as providing neurological evidence that signed languages consist of linguistic gestures and not simply elab- orate pantomimes. Figure 4 is based on a scheme offered by Arbib (2004) in response to Hurford���s (2004) critique of the mirror system hypothesis. Hurford makes the crucial point that we must (in the spirit of Saussure) distinguish the ���sign��� from the ���signified.��� In the figure, we distinguish the ���neural repre- sentation of the sign��� (top row) from the ���neural represen- tation of the signified��� (bottom row). The top row of the fig- ure makes explicit the result of the progression within the mirror system hypothesis of mirror systems for: 1. Grasping and manual praxic actions. 2. Pantomime of grasping and manual praxic actions. 3. Pantomime of actions outside the pantomimic���s own behavioral repertoire (e.g., flapping the arms to mime a fly- ing bird). 4. Conventional gestures used to formalize and disam- biguate pantomime (e.g., to distinguish ���bird��� from ���fly- ing���). 5. Protosign, comprising conventionalized manual (and related orofacial) communicative gestures. However, I disagree with Hurford���s suggestion that there is a mirror system for all concepts ��� actions, objects, and more ��� which links the perception and action related to each concept.22 In schema theory (Arbib 1981 2003), I dis- tinguish between perceptual schemas, which determine whether a given ���domain of interaction��� is present in the environment and provide parameters concerning the cur- rent relationship of the organism with that domain, and mo- tor schemas, which provide the control systems which can be coordinated to effect a wide variety of actions. Recog- Arbib: From monkey-like action recognition to human language: An evolutionary framework for neurolinguistics 116 BEHAVIORAL AND BRAIN SCIENCES (2005) 28:2
nizing an object (an apple, say) may be linked to many dif- ferent courses of action (to place the apple in one���s shop- ping basket to place the apple in the bowl at home to peel the apple to eat the apple to discard a rotten apple, etc.). In this list, some items are apple-specific, whereas other in- voke generic schemas for reaching and grasping. Such con- siderations led me to separate perceptual and motor schemas ��� a given action may be invoked in a wide variety of circumstances a given perception may, as part of a larger assemblage, precede many courses of action. Hence, I re- ject the notion of a mirror system for concepts. Only rarely (as in the case of certain basic actions such as grasp or run, or certain expressions of emotion) will the perceptual and motor schemas be integrated into a ���mirror schema.��� I do not see a ���concept��� as corresponding to one word, but rather to a graded set of activations of the schema network. But if this is the case, does a mirror system for protosigns (and, later, for the words and utterances of a language) re- ally yield the LR3 form of the mirror property ��� that what counts for the sender must count for the receiver? Actually, it yields only half of this directly: the recognition that the action of the observed protosigner is his or her version of one of the conventional gestures in the observer���s reper- toire. The claim, then, is that the LR3 form of the mirror property ��� that which counts for the sender must count for the receiver ��� does not result from the evolution of the F5 mirror system in and of itself to support communicative gestures as well as praxic actions rather, this evolution oc- curs within the neural context that links the execution and observation of an action to the creature���s planning of its own actions and interpretations of the actions of others (Fig. 5). These linkages extract more or less coherent patterns from the creature���s experience of the effects of its own actions as well as the consequences of actions by others. Similarly, ex- ecution and observation of a communicative action must be linked to the creature���s planning and interpretations of communication with others in relation to the ongoing be- haviors that provide the significance of the communicative gestures involved. 6. The emergence of protospeech 6.1. The path to protospeech is indirect My claim here is that the path to protospeech is indirect, with early protosign providing a necessary scaffolding for the emergence of protospeech. I thus reject the claim that speech evolved directly as an elaboration of a closed reper- toire of alarm calls and other species-specific vocalizations such as exhibited by nonhuman primates. However, I claim neither that protosign attained the status of a full language prior to the emergence of early forms of protospeech, nor even that stage S5 (protosign) was completed before stage S6 (protospeech) began. Manual gesture certainly appears to be more conducive to iconic representation than oral gesture. The main argu- ment of section 5 was that the use of pantomime made it easy to acquire a core vocabulary, while the discovery of a growing stock of conventional signs (or sign modifiers) to mark important distinctions then created a culture in which the use of arbitrary gestures would increasingly augment and ritualize (without entirely supplanting) the use of pan- tomime.23 Once an organism has an iconic gesture, it can both modulate that gesture and/or symbolize it (non-icon- ically) by ���simply��� associating a vocalization with it. Once the association had been learned, the ���scaffolding��� gesture (like the pantomime that supported its conventionalization, or the caricature that supports the initial understanding of some Chinese ideograms) could be dropped to leave a sym- bol that need have no remaining iconic relation to its refer- ent, even if the indirect associative relationship can be re- called on some occasions. One open question is the extent to which protosign must be in place before this scaffolding can effectively support the development of protospeech. Because there is no direct mapping of sign (with its use of concurrency and signing space) to phoneme sequences, I think that this development is far more of a breakthrough than it may at first sight appear. I have separated S6, the evolution of protospeech, from S5, the evolution of protosign, to stress the point that the role of F5 in grounding the evolution of a protolanguage Arbib: From monkey-like action recognition to human language: An evolutionary framework for neurolinguistics BEHAVIORAL AND BRAIN SCIENCES (2005) 28:2 117 Figure 4. The bidirectional sign relation links words and con- cepts. The top row concerns Phonological Form, which may re- late to signed language as much as to spoken language. The bot- tom row concerns Cognitive Form and includes the recognition of objects and actions. Phonological Form is present only in humans while Cognitive Form is present in both monkeys and humans. The Mirror System Hypothesis hypothesizes that there is a mirror system for words, but there may not be a mirror system for con- cepts. Figure 5. The perceptuomotor coding for both observation and execution contained in the mirror system for manual actions in the monkey is linked to ���conceptual systems��� for interpretation and planning of such actions. The interpretation and planning systems themselves do not have the mirror property save through their linkage to the actual mirror system.
system would work just as well if we and all our ancestors had been deaf. However, primates do have a rich auditory system which contributes to species survival in many ways, of which communication is just one (Ghazanfar 2003). The protolanguage perception system could thus build upon the existing auditory mechanisms in the move to derive proto- speech. However, it appears that considerable evolution of the vocal-motor system was needed to yield the flexible vo- cal apparatus that distinguishes humans from other pri- mates. MacNeilage (1998) offers an argument for how the mechanism for producing consonant-vowel alternations en route to a flexible repertoire of syllables might have evolved from the cyclic mandibular alternations of eating, but offers no clue as to what might have linked such a process to the expression of meaning (but see MacNeilage & Davis, in press b). This problem is discussed much further in Arbib (2005) which spells out how protosign (S5) may have pro- vided a scaffolding for protospeech (S6), forming an ���ex- panding spiral��� wherein the two interacted with each other in supporting the evolution of brain and body that made Homo sapiens ���language-ready��� in a multi-modal integra- tion of manual, facial and vocal actions. New data on mirror neurons for grasping that exhibit au- ditory responses, and on mirror-like properties of orofacial neurons in F5, add to the subtlety of the argument. Kohler et al. (2002) studied mirror neurons for actions which are accompanied by characteristic sounds, and found that a sub- set of these neurons are activated by the sound of the ac- tion (e.g., breaking a peanut in half) as well as sight of the action. Does this suggest that protospeech mediated by the F5 homologue in the hominid brain could have evolved without the scaffolding provided by protosign? My answer is negative for two reasons: (1) I have argued that imitation is crucial to grounding pantomime in which a movement is performed in the absence of the object for which such a movement would constitute part of a praxic action. How- ever, the sounds studied by Kohler et al. (2002) cannot be created in the absence of the object, and there is no evi- dence that monkeys can use their vocal apparatus to mimic the sounds they have heard. I would further argue that the limited number and congruence of these ���auditory mirror neurons��� is more consistent with the view that manual ges- ture is primary in the early stages of the evolution of lan- guage readiness, with audiomotor neurons laying the basis for later extension of protosign to protospeech. Complementing earlier studies on hand neurons in macaque F5, Ferrari et al. (2003) studied mouth motor neu- rons in F5 and showed that about one-third of them also dis- charge when the monkey observes another individual per- forming mouth actions. The majority of these ���mouth mirror neurons��� become active during the execution and observa- tion of mouth actions related to ingestive functions such as grasping, sucking, or breaking food. Another population of mouth mirror neurons also discharges during the execution of ingestive actions, but the most effective visual stimuli in triggering them are communicative mouth gestures (e.g., lip-smacking) ��� one action becomes associated with a whole performance of which one part involves similar movements. This fits with the hypothesis that neurons learn to associate patterns of neural firing rather than being committed to learn specifically pigeonholed categories of data. Thus, a po- tential mirror neuron is in no way committed to become a mirror neuron in the strict sense, even though it may be more likely to do so than otherwise. The observed commu- nicative actions (with the effective executed action for dif- ferent ���mirror neurons��� in parentheses) include lip-smack- ing (sucking and lip-smacking) lips protrusion (grasping with lips, lips protrusion, lip-smacking, grasping, and chew- ing) tongue protrusion (reaching with tongue) teeth-chat- ter (grasping) and lips/tongue protrusion (grasping with lips and reaching with tongue grasping). We therefore see that the communicative gestures and their associated effective observed actions are a long way from the sort of vocalizations that occur in speech (see Fogassi & Ferrari [in press] for fur- ther discussion). Rizzolatti and Arbib (1998) stated that ���This new use of vocalization [in speech] necessitated its skillful control, a re- quirement that could not be fulfilled by the ancient emo- tional vocalization centers. This new situation was most likely the ���cause��� of the emergence of human Broca���s area.��� I would now rather say that Homo habilis and even more so Homo erectus had a ���proto-Broca���s area��� based on an F5- like precursor mediating communication by manual and orofacial gestures, which made possible a process of collat- eralization whereby this ���proto��� Broca���s area gained primi- tive control of the vocal machinery, thus yielding increased skill and openness in vocalization, moving from the fixed repertoire of primate vocalizations to the unlimited (open) range of vocalizations exploited in speech. Speech appara- tus and brain regions could then coevolve to yield the con- figuration seen in modern Homo sapiens. Corballis (2003b) argues that there may have been a sin- gle-gene mutation producing a ���dextral��� allele, which cre- ated a strong bias toward right-handedness and left-cere- bral dominance for language at some point in hominid evolution.24 He then suggests that the ���speciation event��� that distinguished Homo sapiens from other large-brained hominids may have been a switch from a predominantly gestural to a predominantly vocal form of language. By con- trast, I would argue that there was no one distinctive speci- ation event, and that the process whereby communication for most humans became predominantly vocal was not a switch but was ���cultural��� and cumulative. 7. The inventions of languages The divergence of the Romance languages from Latin took about one thousand years. The divergence of the Indo-Eu- ropean languages to form the immense diversity of Hindi, German, Italian, English, and so on took about 6,000 years (Dixon 1997). How can we imagine what has changed since the emergence of Homo sapiens some 200,000 years ago? Or in 5,000,000 years of prior hominid evolution? I claim that the first Homo sapiens were language-ready but did not have language in the modern sense. Rather, my hypothesis is that stage S7, the transition from protolanguage to lan- guage, is the culmination of manifold discoveries in the his- tory of mankind: In section 2, I asserted that in much of protolanguage, a complete communicative act involved a unitary utterance, the use of a single symbol formed as a sequence of gestures, whose component gestures ��� whether manual or vocal ��� had no independent meaning. Unitary utterances such as ���grooflook��� or ���koomzash��� might have encoded quite com- plex descriptions such as ���The alpha male has killed a meat animal and now the tribe has a chance to feast together. Yum, yum!��� or commands such as ���Take your spear and go Arbib: From monkey-like action recognition to human language: An evolutionary framework for neurolinguistics 118 BEHAVIORAL AND BRAIN SCIENCES (2005) 28:2
around the other side of that animal and we will have a bet- ter chance together of being able to kill it.��� On this view, ���protolanguage��� grew by adding arbitrary novel unitary ut- terances to convey complex but frequently important situa- tions, and it was a major later discovery en route to language as we now understand it that one could gain expressive power by fractionating such utterances into shorter utter- ances conveying components of the scene or command (cf. Wray 1998 2000). Put differently, the utterances of prelan- guage were more akin to the ���calls��� of modern primates ��� such as the ���leopard call��� of the vervet monkey, which is emitted by a monkey who has seen a leopard and which trig- gers the appropriate escape behavior in other monkeys ��� than to sentences as defined in a language like English, but they differed crucially from the primate calls in that new utterances could be invented and acquired through learn- ing within a community, rather than emerging only through biological evolution. Thus, the set of such unitary utter- ances was open, whereas the set of calls was closed. The following hypothetical but instructive example is similar to examples offered at greater length by Wray (1998 2000) to suggest how the fractionation of unitary utterances might occur (and see Kirby  for a related computer simulation): Imagine that a tribe has two unitary utterances concerning fire which, by chance, contain similar substrings which become regularized so that for the first time there is a sign for ���fire.��� Now the two original utterances are mod- ified by replacing the similar substrings by the new regu- larized substring. Eventually, some tribe members regular- ize the complementary gestures in the first string to get a sign for ���burns��� later, others regularize the complementary gestures in the second string to get a sign for ���cooks meat.��� However, because of the arbitrary origin of the sign for ���fire,��� the placement of the gestures that have come to de- note ���burns��� relative to ���fire��� differs greatly from those for ���cooks meat��� relative to ���fire.��� It therefore requires a fur- ther invention to regularize the placement of the gestures in both utterances ��� and in the process, words are crystal- lized at the same time as the protosyntax that combines them. Clearly, such fractionation could apply to protosign as well as to protospeech. However, fractionation is not the only mechanism that could produce composite structures. For example, a tribe might over the generations develop different signs for ���sour apple,��� ���ripe apple,��� ���sour plum,��� ���ripe plum,��� and so on, but not have signs for ���sour��� and ���ripe��� even though the dis- tinction is behaviorally important. Hence, 2n signs are needed to name n kinds of fruit. Occasionally someone will eat a piece of sour fruit by mistake and make a characteris- tic face and intake of breath when doing so. Eventually, some genius pioneers the innovation of getting a conven- tionalized variant of this gesture accepted as the sign for ���sour��� by the community, to be used as a warning before eating the fruit, thus extending the protolanguage.25 A step towards language is taken when another genius gets people to use the sign for ���sour��� plus the sign for ���ripe X��� to re- place the sign for ���sour X��� for each kind X of fruit. This in- novation allows new users of the protolanguage to simplify learning fruit names, since now only n 1 names are re- quired for the basic vocabulary, rather than 2n as before. More to the point, if a new fruit is discovered, only one name need be invented rather than two. I stress that the in- vention of ���sour��� is a great discovery in and of itself. It might take hundreds of such discoveries distributed across cen- turies or more before someone could recognize the com- monality across all these constructions and thus invent the precursor of what we would now call adjectives.26 The latter example is meant to indicate how a sign for ���sour��� could be added to the protolanguage vocabulary with no appeal to an underlying ���adjective mechanism.��� Instead, one would posit that the features of language emerged by bricolage (tinkering) which added many features as ���patches��� to a protolanguage, with general ���rules��� emerg- ing both consciously and unconsciously only as generaliza- tions could be imposed upon, or discerned in, a population of ad hoc mechanisms. Such generalizations amplified the power of groups of inventions by unifying them to provide expressive tools of greatly extended range. According to this account, there was no sudden transition from unitary ut- terances to an elaborate language with a rich syntax and compositional semantics no point at which one could say of a tribe ���Until now they used protolanguage but henceforth they use language.��� To proceed further, I need to distinguish two ���readings��� of a case frame like Grasp(Leo, raisin), as an action-object frame and as a verb-argument structure. I chart the transi- tion as follows: (1) As an action-object frame, Grasp(Leo, raisin) repre- sents the perception that Leo is grasping a raisin. Here the action ���grasp��� involves two ���objects,��� one the ���grasper��� Leo and the other the ���graspee,��� the ���raisin.��� Clearly the mon- key has the perceptual capability to recognize such a situa- tion27 and enter a brain state that represents it, with that representation distributed across a number of brain re- gions. Indeed, in introducing principle LR5 (from hierar- chical structuring to temporal ordering) I noted that the ability to translate a hierarchical conceptual structure into a temporally ordered structure of actions is apparent when- ever an animal takes in the nature of a visual scene and pro- duces appropriate behavior. But to have such a capability does not entail the ability to communicate in a way that re- flects these structures. It is also crucial to note here the im- portance of recognition not only of the action (mediated by F5) but also of the object (mediated by IT). Indeed, Figure 2 (the FARS model) showed that the canonical activity of F5 already exhibits a choice between the affordances of an object (mediated by the dorsal stream) that involves the na- ture of the object (as recognized by IT and elaborated upon in PFC in a process of ���action-oriented perception���). In the same way, the activity of mirror neurons does not rest solely upon the parietal recognition (in PF, Fig. 3) of the hand mo- tion and the object���s affordances (AIP) but also on the ���se- mantics��� of the object as extracted by IT. In the spirit of Fig- ure 2, I suggest that this semantics is relayed via PFC and thence through AIP and PF to F5 to affect there the mir- ror neurons as well as the canonical neurons. (2) My suggestion is that at least the immediate hominid precursors of Homo sapiens would have been able to per- ceive a large variety of action-object frames and, for many of these, to form a distinctive gesture or vocalization to ap- propriately direct the attention of another tribe member, but that the vocalization used would be in general a unitary utterance which need not have involved separate lexical en- tries for the action or the objects. However, the ability to symbolize more and more situations would have required the creation of a ���symbol tool kit��� of meaningless ele- ments28 from which an open-ended class of symbols could be generated. Arbib: From monkey-like action recognition to human language: An evolutionary framework for neurolinguistics BEHAVIORAL AND BRAIN SCIENCES (2005) 28:2 119