Empirical Tests of the Gradual Le...
Empirical Tests of the Gradual Learning Algorithm* Paul Boersma Bruce Hayes University of Amsterdam UCLA September 29, 1999 Abstract The Gradual Learning Algorithm (Boersma 1997) is a constraint ranking algorithm for learning Optimality-theoretic grammars. The purpose of this article is to assess the capabilities of the Gradual Learning Algorithm, particularly in comparison with the Constraint Demotion algorithm of Tesar and Smolensky (1993, 1996, 1998), which initiated the learnability research program for Optimality Theory. We argue that the Gradual Learning Algorithm has a number of special advantages: it can learn free variation, avoid failure when confronted with noisy learning data, and account for gradient well-formedness judgments. The case studies we examine involve Ilokano reduplication and metathesis, Finnish genitive plurals, and the distribution of English light and dark /l/. 1 Introduction Optimality Theory (Prince and Smolensky 1993) has made possible a new and fruitful approach to the problem of phonological learning. If the language learner has access to an appropriate inventory of constraints, then a complete grammar can be derived, provided there is an algorithm available that can rank the constraints on the basis of the input data. This possibility has led to a line of research on ranking algorithms, originating with the work of Tesar and Smolensky (1993, 1996, 1998 Tesar 1995) who propose an algorithm called Constraint Demotion, reviewed below. Other work on ranking algorithms includes Pulleyblank and Turkel (1995, 1996, 1998, to appear), Broihier (1995), and Hayes (1999). Our focus here is the Gradual Learning Algorithm, as developed by Boersma (1997, 1998, to appear). This algorithm is in some respects a development of Tesar and Smolensky���s proposal: it directly perturbs constraint rankings in response to language data, and, like most previously proposed algorithms, it is error-driven, in that it alters rankings only when the input data conflict with its current ranking hypothesis. What is different about the Gradual Learning Algorithm is the type of Optimality-Theoretic grammar it presupposes: rather than a set of discrete rankings, it assumes a continuous scale of constraint strictness. Also, the grammar is regarded as stochastic: at every evaluation of the candidate set, a small noise component is temporarily added to the ranking value of each constraint, so that the grammar can produce variable outputs if some constraint rankings are close to each other. * We would like to thank Arto Anttila for helpful input in the preparation of this paper. Thanks also to Louis Pols, the University of Utrecht, and the UCLA Academic Senate for material assistance in making our joint work possible. The work of the first author was supported by a grant from the Netherlands Organization for Scientific Research.
P A U L B O E R S M A A N D B R U C E H A Y E S 2 The continuous ranking scale implies a different response to input data: rather than a wholesale reranking, the Gradual Learning Algorithm executes only small perturbations to the constraints��� locations along the scale. We argue that this more conservative approach yields important advantages in three areas. First, the Gradual Learning Algorithm can fluently handle optionality it readily forms grammars that can generate multiple outputs. Second, the algorithm is robust, in the sense that speech errors occurring in the input data do not lead it off course. Third, the algorithm is capable of developing formal analyses of linguistic phenomena in which speakers��� judgments involve intermediate well-formedness. A paradoxical aspect of the Gradual Learning Algorithm is that, even though it is statistical and gradient in character, most of the constraint rankings it learns are (for all practical purposes) categorical. These categorical rankings emerge as the limit of gradual learning. Categorical rankings are of course crucial for learning data patterns where there is no optionality. Learning algorithms can be assessed on both theoretical and empirical grounds. At the purely theoretical level, we want to know if an algorithm can be guaranteed to learn all grammars that possess the formal properties it presupposes. Research results on this question as it concerns the Gradual Learning Algorithm are reported in Boersma (1997, 1998, to appear). On the empirical side, we need to show that natural languages are indeed appropriately analyzed with grammars of the formal type the algorithm can learn. This paper focuses on the second of these two tasks. We confront the Gradual Learning Algorithm with a variety of representative phonological phenomena, in order to assess its capabilities in various ways. This approach reflects our belief that learning algorithms can be tested just like other proposals in linguistic theory, by checking them out against language data. A number of our data examples are taken from the work of the second author, who arrived independently at the notion of a continuous ranking scale, and has with colleagues developed a number of hand-crafted grammars that work on this basis (Hayes and MacEachern 1998 Hayes, to appear). We will begin by reviewing how the Gradual Learning Algorithm works, then present several empirical applications. A study of Ilokano phonology shows how the algorithm can cope with data involving systematic optionality. We also use a restricted subset of the Ilokano data to simulate the response of the algorithm to speech errors. In both cases, we make comparisons with the behavior of the Constraint Demotion Algorithm. We next turn to the study of output frequencies, posed as an additional, stringent empirical test of the Gradual Learning Algorithm. We use the algorithm to replicate the study of Anttila (1997a,b) on Finnish genitive plurals. Lastly we turn to gradient well-formedness, showing that the algorithm can replicate the results on English /l/ derived with a hand-crafted grammar by Hayes (to appear). 2 How the Gradual Learning Algorithm Works Two concepts crucial to the Gradual Learning Algorithm are the continuous ranking scale and stochastic candidate evaluation. We cover these first, then turn to the internal workings of the algorithm. 2.1 The Continuous Ranking Scale The algorithm presupposes a linear scale of constraint strictness, in which higher values correspond to higher-ranked constraints. The scale is arranged in arbitrary units, and in principle
E M P I R I C A L T E S T S O F T H E G R A D U A L L E A R N I N G A L G O R I T H M 3 has no upper or lower bound. Other work that has suggested or adopted a continuous scale includes Liberman (1993:21, cited in Reynolds 1994), Zubritskaya (1997:142-4), Hayes and MacEachern (1998), and Hayes (to appear). Continuous scales include strict constraint ranking as a special case. For instance, the scale depicted graphically in (1) illustrates the straightforward nonvariable ranking C1 C2 C3. (1) Categorical ranking along a continuous scale strict lax (high ranked) (low ranked) C3 C1 C2 2.2 How Stochastic Evaluation Generates Variation The continuous scale becomes more meaningful when differences in distance have observable consequences, e.g. if the short distance between C2 and C3 in (1) tells us that the relative ranking of this constraint pair is less fixed than that of C1 and C2. We suggest that in the process of speaking (i.e. at evaluation time, when the candidates in a tableau have to be evaluated in order to determine a winner), the position of each constraint is temporarily perturbed by a random positive or negative value. In this way, the constraints act as if they are associated with ranges of values, instead of single points. We will call the value used at evaluation time a selection point. The value more permanently associated with the constraint, i.e. the center of the range, will be called the ranking value. Here there are two main possibilities. If the ranges covered by the selection points do not overlap, the ranking scale again merely recapitulates ordinary categorical ranking: (2) Categorical ranking with ranges strict lax C1 C2 But if the ranges overlap, there will be free (variable) ranking: (3) Free ranking strict lax C2 C3 The reason is that, at evaluation time, it is possible to choose the selection points from anywhere within the ranges of the two constraints. In (3), this would most often result in C2 outranking C3, but if the selection points are taken from the upper part of C3���s range, and the lower part of C2���s, then C3 would outrank C2. The two possibilities are shown below /���2/ and /���3/ depict the selection points for C2 and C3.
P A U L B O E R S M A A N D B R U C E H A Y E S 4 (4) a. Common result: C2 C3 strict lax ���2 ���3 C2 C3 b. Rare result: C3 C2 strict lax ���2 ���3 C2 C3 When one sorts all the constraints in the grammar by their selection points, one obtains a total ranking to be employed for a particular evaluation time. With this total ranking, the ordinary competition of candidates (supplied by the GEN function of Optimality Theory) takes place and determines the winning output candidate.1 The above description covers how the system in (4) behaves at one single evaluation time. Over a longer sequence of evaluations, the overlapping ranges will often yield an important observable effect: for forms in which C2 C3 yields a different output than C3 C2, one will observe free variation, i.e. multiple outputs for a single underlying form. To implement these ideas more precisely, we interpret the constraint ranges as probability distributions (Boersma 1997, 1998 Hayes and MacEachern 1998). We assume a function that specifies the probability that a selection point will occur at any given distance above or below the ranking value at evaluation time. By using probability distributions, one can not only enumerate the set of outputs generated by a grammar, but also make predictions about their relative frequencies, a matter that will turn out to be important below. Many noisy events in the real world occur with probabilities that are appropriately described with a normal (= Gaussian) distribution. A normal distribution has a single peak in the center, which means that values around the center are most probable, and declines gently but swiftly toward zero on each side. Values become less probable the farther they are away from the center, without ever actually becoming zero: (5) The normal distribution �� �� +�� ������� �� �� A normal distribution is described by its mean ��, which occurs at its center, and its standard deviation ��, which describes the ���breadth��� of the curve. Approximately 68 percent of the values drawn from a normal distribution lie within one standard deviation from the mean, i.e. between 1 The mechanism for determining the winning output in Optimality Theory, with GEN and a ranked constraint set, will not be reviewed here. For background, see Prince and Smolensky���s original work (1993), or textbooks such as Archangeli and Langendoen (1997) and Kager (1999).
E M P I R I C A L T E S T S O F T H E G R A D U A L L E A R N I N G A L G O R I T H M 5 ������� and ��+��. The Gradual Learning Algorithm makes the assumption that selection points for natural language constraints are distributed normally, with the mean of the distribution occurring at the ranking value. The normal distributions are assumed to have the same standard deviation for every constraint, for which we normally adopt the arbitrary value of 2.0. 2 In this approach, the behavior of a constraint set depends on its ranking values alone constraints cannot be individually assigned standard deviations. The process of learning an appropriate constraint ranking therefore consists solely of finding a workable set of ranking values. When discussing the derivation of forms using a set of constraints, we will use the term evaluation noise to designate the standard deviation of the distribution (��) the term is intended to suggest that this value resides in the evaluation process itself, not in the constraints. We illustrate these concepts with two hypothetical constraints and their associated normal distributions on an arbitrary scale: (6) Overlapping ranking distributions 90 88 86 84 82 80 strict lax C1 C2 In (6), the ranking values for C1 and C2 are at the hypothetical values 87.7 and 83.1. Since the evaluation noise is 2.0, the normal distributions assigned to C1 and C2 overlap substantially. While the selection points for C1 and C2 will most often occur somewhere in the central ���hump��� of their distributions, they will on occasion be found quite a bit further away. Thus, C1 will outrank C2 at evaluation time in most cases, but the opposite ranking will occasionally hold. Simple calculations show that the percentages for these outcomes will tend towards the values 94.8% (C1 C2) and 5.2% (C2 C1). 2.3 How can there not be variation? A worry that may have presented itself to the reader at this point is: how can this scheme depict obligatory constraint ranking, if the values of the normal distribution never actually reach zero? The answer is that when two constraints have distributions that are dramatically far apart, the odds of a deviant ranking become vanishingly low. Thus, if two distributions are 5 standard deviations apart, the odds that a ���reversed��� ranking could emerge are about 1 in 5,000. This frequency would be hard to distinguish empirically, we think, from the background noise of speech errors. If the distributions are 9 standard deviations apart, the chances of a ���reversed��� ranking are 1 in 10 billion, implying that one would not expect to observe a form derived by this ranking even if one monitored a speaker for an entire lifetime. In applying the Gradual Learning Algorithm, we often find that it places constraints at distances of tens or even hundreds of standard deviations apart, giving what is to all intents and purposes non-variable ranking. Often, constraints occur ranked in long transitive chains. The ranking scheme depicted here can treat such cases, since the strictness continuum is assumed to have no upper or lower bounds, 2 Since the units of the ranking scale are themselves arbitrary, it does not matter what standard deviation is used, so long as it is the same for all constraints.