Universal Grammar, statistics or both?
- PubMed: 15450509
Abstract
Recent demonstrations of statistical learning in infants have reinvigorated the innateness versus learning debate in language acquisition. This article addresses these issues from both computational and developmental perspectives. First, I argue that statistical learning using transitional probabilities cannot reliably segment words when scaled to a realistic setting (e.g. child-directed English). To be successful, it must be constrained by knowledge of phonological structure. Then, turning to the bona fide theory of innateness-the Principles and Parameters framework-I argue that a full explanation of children's grammar development must abandon the domain-specific learning model of triggering, in favor of probabilistic learning mechanisms that might be domain-general but nevertheless operate in the domain-specific space of syntactic parameters.
Author-supplied keywords
Universal Grammar, statistics or both?
Opinion TRENDS in Cognitive Sciences Vol.8 No.10 October 2004ity. Such a disparity in research agenda stems from thebutions from innate knowledge and experience-based
learning. Some researchers, in particular linguists,
approach language acquisition by characterizing the
scope and limits of innate principles of Universal Gram-
mar that govern the world’s languages. Others, in
particular psychologists, tend to emphasize the role of
experience and the child’s domain-general learning abil-
pretty and baby is correctly postulated. It is remarkable
that 8-month-old infants can in fact extract three-syllable
words in the continuous speech of an artificial language
from only two minutes of exposure [8]. Let us call this SL
model using local minima, SLM.
Statistics does not refute UGControversies arise when it comes to the relative contri-Contributions of endowment and learning
‘prettybaby’, TP(pre/tty) and TP(ba/by) are both higher
than TP(tty/ba). Therefore, a word boundary betweenUniversal Grammar
Charles D. Yang
Department of Linguistics and Psychology, Yale University, 370
Recent demonstrations of statistical learning in infants
have reinvigorated the innateness versus learning
debate in language acquisition. This article addresses
these issues from both computational and developmen-
tal perspectives. First, I argue that statistical learning
using transitional probabilities cannot reliably segment
words when scaled to a realistic setting (e.g. child-
directed English). To be successful, it must be con-
strained by knowledge of phonological structure. Then,
turning to the bona fide theory of innateness – the
Principles and Parameters framework – I argue that a full
explanation of children’s grammar development must
abandon the domain-specific learning model of trigger-
ing, in favor of probabilistic learning mechanisms that
might be domain-general but nevertheless operate in
the domain-specific space of syntactic parameters.
Two facts about language learning are indisputable. First,
only a human baby, but not her pet kitten, can learn a
language. It is clear, then, that there must be some
element in our biology that accounts for this unique
ability. Chomsky’s Universal Grammar (UG), an innate
form of knowledge specific to language, is a concrete
theory of what this ability is. This position gains support
from formal learning theory [1–3], which sharpens the
logical conclusion [4,5] that no (realistically efficient)
learning is possible without a priori restrictions on the
learning space. Second, it is also clear that no matter how
much of a head start the child gains throughUG, language
is learned. Phonology, lexicon and grammar, although
governed by universal principles and constraints, do vary
from language to language, and they must be learned on
the basis of linguistic experience. In other words, it is a
truism that both endowment and learning contribute to
language acquisition, the result of which is an extremely
sophisticated body of linguistic knowledge. Consequently,
both must be taken into account, explicitly, in a theory of
language acquisition [6,7].statistics or both?
mple Street 302, New Haven, CT 06511, USA
division of labor between endowment and learning:
plainly, things that are built in need not be learned, and
things that can be garnered from experience need not
be built in.
The influential work of Saffran, Aslin and Newport [8]
on statistical learning (SL) suggests that children are
powerful learners. Very young infants can exploit transi-
tional probabilities between syllables for the task of word
segmentation, with only minimal exposure to an artificial
language. Subsequent work has demonstrated SL in other
domains including artificial grammar, music and vision,
as well as SL in other species [9–12]. Therefore, language
learning is possible as an alternative or addition to the
innate endowment of linguistic knowledge [13].
This article discusses the endowment versus learning
debate, with special attention to both formal and deve-
lopmental issues in child language acquisition. The first
part argues that the SL of Saffran et al. cannot reliably
segment words when scaled to a realistic setting
(e.g. child-directed English). Its application and effective-
ness must be constrained by knowledge of phonological
structure. The second part turns to the bona fide theory of
UG – the Principles and Parameters (P&P) framework
[14,15]. It is argued that an adequate explanation of
children’s grammar must abandon the domain-specific
learning models such as triggering [16,17] in favor of
probabilistic learning mechanisms that may well be
domain-general.
Statistics with UG
It has been suggested [8,18] that word segmentation from
continuous speechmight be achieved by using transitional
probabilities (TP) between adjacent syllables A and B,
where TP(A/B)ZP(AB)/P(A), where P(AB)Zfrequency
of B following A, and P(A)Ztotal frequency of A. Word
boundaries are postulated at ‘local minima’, where the
TP is lower than its neighbors. For example, given
sufficient exposure to English, the learner might be
able to establish that, in the four-syllable sequencemust have an appropriate representation of the relevant
(learning) data. We therefore need to be cautious about the
Corresponding author: Charles D. Yang (charles.yang@yale.edu).
www.sciencedirect.com 1364-6613/$ - see front matter Q 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.tics.2004.08.006To be effective, a learning algorithm – or any algorithm –
themselves stress [19]. If anything, it seems that their
results [8] strengthen, rather than weaken, the case for
innate linguistic knowledge.
A classic argument for innateness [4,5,20] comes from
the fact that syntactic operations are defined over specific
types of representations – constituents and phrases – but
not over, say, linear strings of words, or numerous other
logical possibilities. Although infants seem to keep track of
statistical information, any conclusion drawn from such
findings must presuppose that children know what kind
of statistical information to keep track of. After all, an
infinite range of statistical correlations exists: for
example, What is the probability of a syllable rhyming
with the next? What is the probability of two adjacent
vowels being both nasal? The fact that infants can use
SLM at all entails that, at minimum, they know the
innately available, as suggested by speech perception
studies in neonates [21]. Second, where do the syllables
Opinion TRENDS in Cognitive Sciences Vol.8 No.10 October 2004452come from? Although the experiments of Saffran et al. [8]
used uniformly consonant–vowel (CV) syllables in an
artificial language, real-world languages, including
English, make use of a far more diverse range of syllabic
types. Third, syllabification of speech is far from trivial,
involving both innate knowledge of phonological struc-
tures as well as discovering language-specific instanti-
ations [22]. All these problems must be resolved before
SLM can take place.
Statistics requires UG
To give a precise evaluation of SLM in a realistic setting,
we constructed a series of computational models tested on
speech spoken to children (‘child-directed English’) [23,24]
(see Box 1). It must be noted that our evaluation focuses
on the SLM model, by far the most influential work in the
Box 1. Modeling word segmentation
The learning data consists of a random sample of child-directed
English sentences from the CHILDES database [25] The words were
then phonetically transcribed using the Carnegie Mellon Pronunci-
ation Dictionary, and were then grouped into syllables. Spaces
between words are removed; however, utterance breaks are
available to the modeled learner. Altogether, there are 226 178
words, consisting of 263 660 syllables.
Implementing statistical learning using local minima (SLM) [8] is
straightforward. Pairwise transitional probabilities are gathered
from the training data, which are then used to identify local minima
and postulate word boundaries in the on-line processing of syllable
sequences. Scoring is done for each utterance and then averaged.
Viewed as an information retrieval problem, it is customary [26] to
report both precision and recall of the performance.
PrecisionZNo. of correct words/No. of all words extracted by SLM
RecallZNo. of words correctly extracted by SLM/No. of actual
words
For example, if the target is ‘in the park’, and the model conjecturesrelevant unit of information over which correlative
statistics is gathered; in this case, it is the syllable, rather
than segments, or front vowels, or labial consonants.
A host of questions then arises. First, how do infants
know to pay attention to syllables? It is at least plausible
that the primacy of syllables as the basic unit of speech is‘inthe park’, then precision is 1/2, and recall is 1/3.
www.sciencedirect.comSL tradition; its success or failure may or may not carry
over to other SL models.
The segmentation results using SLM are poor, even
assuming that the learner has already syllabified the
input perfectly. Precision is 41.6%, and recall is 23.3%
(using the definitions in Box 1); that is, over half of the
words extracted by the model are not actual words, and
close to 80% of actual words are not extracted. This is
unsurprising, however. In order for SLM to be usable, a
TP at an actual word boundary must be lower than its
neighbors. Obviously, this condition cannot be met if the
input is a sequence of monosyllabic words, for which a
space must be postulated for every syllable; there are no
local minima to speak of. Whereas the pseudowords
in Saffran et al. [8] are uniformly three-syllables long,
much of child-directed English consists of sequences of
monosyllabic words: on average, a monosyllabic word is
followed by another monosyllabic word 85% of time. As
long as this is the case, SLM cannot work in principle.
Yet this remarkable ability of infants to use SLM could
still be effective for word segmentation. It must be con-
strained – like any learning algorithm, however powerful
– as suggested by formal learning theories [1–3]. Its
performance improves dramatically if the learner is
equipped with even a small amount of prior knowledge
about phonological structures. To be specific, we assume,
uncontroversially, that each word can have only one
primary stress. (This would not work for a small and
closed set of functional words, however.) If the learner
knows this, then ‘bigbadwolf ’ breaks into three words for
free. The learner turns to SLM only when stress
information fails to establish word boundary; that is, it
limits the search for local minima in the window between
two syllables that both bear primary stress, for example,
between the two a’s in the sequence ‘languageacquisition’.
This is plausible given that 7.5-month-old infants are
sensitive to strong/weak patterns (as in fallen) of prosody
[22]. Once such a structural principle is built in, the
stress-delimited SLM can achieve precision of 73.5% and
recall of 71.2%, which compare favorably to the best
performance previously reported in the literature [26].
(That work, however, uses a computationally prohibitive
algorithm that iteratively optimizes the entire lexicon.)
Modeling results complement experimental findings
that prosodic information takes priority over statistical
information when both are available [27], and are in the
same vein as recent work [28] on when and where SL
is effective or possible. Again, though, one needs to be
cautious about the improved performance, and several
unresolved issues need to be addressed by future work
(see Box 2). It remains possible that SLM is not used
at all in actual word segmentation. Once the one-word/
one-stress principle is built in, we can consider a model
that uses no SL, hence avoiding the probably considerable
computational cost. (We don’t know how infants keep
track of TPs, but it is certainly non-trivial. English has
thousands of syllables; now take the quadratic for the
number of pair-wise TPs.) It simply stores previously
extracted words in the memory to bootstrap new words.Young children’s familiar segmentation errors (e.g. ‘I was
have’ from be-have,‘hiccing up’ from hicc-up,‘two dults’,
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime



