Parse decoration of the word sequence in the speech-to-text machine-translation pipeline
Abstract
Parsing, or the extraction of syntactic structure from text, is appealing to natural language processing (NLP) engineers and researchers. Parsing provides an opportunity to consider information about word sequence and relatedness beyond simple adjacency. This dissertation uses automatically-derived syntactic structure (parse decoration) to improve the performance and evaluation of large-scale NLP systems that have (in general) used only word-sequence level measures to quantify success. In particular, this work focuses on parse structure in the context of large-vocabulary automatic speech recognition (ASR) and statistical machine translation (SMT) in English and (in translation) Mandarin Chinese. The research here explores three characteristics of statistical syntactic parsing: dependency structure, constituent structure, and parse-uncertainty - making use of the parser's ability to generate an M-best list of parse hypotheses. Parse structure predictions are applied to ASR to improve word-error rate over a baseline non-syntactic (sequence-only) language model (achieving 6-13% of possible error reduction). Critical to this success is the joint reranking of an N M-best list of N ASR hypothesis transcripts and M-best parse hypotheses (for each transcript). Jointly reranking the N xM lists is also demonstrated to be useful in choosing a high-quality parse from these transcriptions. In SMT, this work demonstrates expected dependency pair match (EDPM), a new mechanism for evaluating the quality of SMT translation hypotheses by comparing them to reference translations. EDPM, which makes direct use of parse dependency structure directly in its measurement, is demonstrated to be superior in correlation with human measurements of translation quality to the competitor (and widely-used) evaluation metrics BLEU4 and translation edit rate. Finally, this work explores how syntactic constituents may predict or improve the behavior of unsupervised word-aligners, a core component of SMT systems, over a collection of Chinese-English parallel text with reference alignment labels. Statistical word-alignment is improved over several machine-generated alignments by exploiting the coherence of certain parse constituent structures to identify source-language regions where a high-recall aligner may be trusted. These diverse results across ASR and SMT point together to the utility of including parse information into large-scale (and generally word-sequence oriented) NLP systems and demonstrate several approaches for doing so.
Parse decoration of the word sequence in the speech-to-text machine-translation pipeline
Copyright 2010
Jeremy G. Kahn
machine-translation pipeline
Jeremy G. Kahn
A dissertation submitted in partial fulllment
of the requirements for the degree of
Doctor of Philosophy
University of Washington
2010
Program Authorized to Oer Degree: Linguistics
Graduate School
This is to certify that I have examined this copy of a doctoral dissertation by
Jeremy G. Kahn
and have found that it is complete and satisfactory in all respects,
and that any and all revisions required by the nal
examining committee have been made.
Chair of the Supervisory Committee:
Mari Ostendorf
Reading Committee:
Mari Ostendorf
Paul Aoki
Emily M. Bender
Fei Xia
Date:
degree at the University of Washington, I agree that the Library shall make its copies
freely available for inspection. I further agree that extensive copying of this dissertation is
allowable only for scholarly purposes, consistent with \fair use" as prescribed in the U.S.
Copyright Law. Requests for copying or reproduction of this dissertation may be referred
to Proquest Information and Learning, 300 North Zeeb Road, Ann Arbor, MI 48106-1346,
1-800-521-0600, to whom the author has granted \the right to reproduce and sell (a) copies
of the manuscript in microform and/or (b) printed copies of the manuscript made from
microform."
Signature
Date
Abstract
Parse decoration of the word sequence in the speech-to-text machine-translation
pipeline
Jeremy G. Kahn
Chair of the Supervisory Committee:
Professor Mari Ostendorf
Electrical Engineering & Linguistics
Parsing, or the extraction of syntactic structure from text, is appealing to natural lan-
guage processing (NLP) engineers and researchers. Parsing provides an opportunity to
consider information about word sequence and relatedness beyond simple adjacency. This
dissertation uses automatically-derived syntactic structure (parse decoration) to improve
the performance and evaluation of large-scale NLP systems that have (in general) used
only word-sequence level measures to quantify success. In particular, this work focuses on
parse structure in the context of large-vocabulary automatic speech recognition (ASR) and
statistical machine translation (SMT) in English and (in translation) Mandarin Chinese.
The research here explores three characteristics of statistical syntactic parsing: dependency
structure, constituent structure, and parse-uncertainty | making use of the parser's ability
to generate an M -best list of parse hypotheses.
Parse structure predictions are applied to ASR to improve word-error rate over a baseline
non-syntactic (sequence-only) language model (achieving 6{13% of possible error reduction).
Critical to this success is the joint reranking of an NM -best list of N ASR hypothesis tran-
scripts and M -best parse hypotheses (for each transcript). Jointly reranking the NM lists
is also demonstrated to be useful in choosing a high-quality parse from these transcriptions.
In SMT, this work demonstrates expected dependency pair match (EDPM), a new mech-
anism for evaluating the quality of SMT translation hypotheses by comparing them to refer-
Page
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Evaluating the word sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Using parse information within automatic language processing . . . . . . . . 4
1.3 Overview of this work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Chapter 2: Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Statistical parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Reranking n-best lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Automatic speech recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Statistical machine translation . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Chapter 3: Parsing Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Corpus and experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Chapter 4: Using grammatical structure to evaluate machine translation . . . . . 61
4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2 Approach: the DPM family of metrics . . . . . . . . . . . . . . . . . . . . . . 63
4.3 Implementation of the DPM family . . . . . . . . . . . . . . . . . . . . . . . . 66
4.4 Selecting EDPM with human judgements of
uency & adequacy . . . . . . . 68
4.5 Correlating EDPM with HTER . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.6 Combining syntax with edit and semantic knowledge sources . . . . . . . . . 74
i
Figure Number Page
2.1 A lexicalized phrase structure and the corresponding constituent and depen-
dency trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 The models that contribute to ASR. . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Word alignment between e and f . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 The models that make up statistical machine translation systems . . . . . . . 24
3.1 A SParseval example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 System architecture at test time. . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 n-best resegmentation using confusion networks . . . . . . . . . . . . . . . . . 38
3.4 Oracle parse performance contours for dierent numbers of parses M and
recognition hypotheses N on reference segmentations. . . . . . . . . . . . . . 51
3.5 SParseval performance for dierent feature and optimization conditions as
a function of the size of the N-best list. . . . . . . . . . . . . . . . . . . . . . 56
4.1 Example dependency trees and their dlh decompositions. . . . . . . . . . . . 64
4.2 The dl and lh decompositions of the hypothesis tree in gure 4.1. . . . . . . 64
4.3 An example headed constituent tree and the labeled dependency tree derived
from it. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4 Pearson's r for various feature tunings, with 95% condence intervals. EDPM,
BLEU and TER correlations are provided for comparison. . . . . . . . . . . . 76
5.1 A Chinese sentence and its translation, with reference alignments and align-
ments generated by unioned GIZA++ . . . . . . . . . . . . . . . . . . . . . . 80
5.2 Examples of the four coherence classes . . . . . . . . . . . . . . . . . . . . . . 83
5.3 Decision trees for VP and IP spans. . . . . . . . . . . . . . . . . . . . . . . . 93
5.4 An example incoherent CP-over-IP. . . . . . . . . . . . . . . . . . . . . . . . . 94
5.5 An example of clause-modifying adverb appearing inside a verb chain . . . . 96
5.6 An example of English ellipsis where Chinese repeats a word. . . . . . . . . . 97
5.7 Example of an NP-guided union. . . . . . . . . . . . . . . . . . . . . . . . . . 103
iii
Table Number Page
1.1 Two ASR hypotheses with the same WER. . . . . . . . . . . . . . . . . . . . 3
1.2 Word-sequences not considered to match by nave word-sequence evaluation . 3
3.1 Reranker feature descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Switchboard data partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3 Segmentation conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Baseline and oracle WER reranking performance from N = 50 word sequence
hypotheses and 1-best parse . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5 Oracle SParseval (WER) reranking performance from N = 50 word se-
quence hypotheses and M = 1, 10, or 50 parses . . . . . . . . . . . . . . . . . 51
3.6 Reranker feature combinations . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.7 WER on the evaluation set for dierent sentence segmentations and feature
sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.8 Word error rate results comparing
. . . . . . . . . . . . . . . . . . . . . . . 54
3.9 Results under dierent segmentation conditions when optimizing for SPar-
seval objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.1 Per-segment correlation with human
uency/adequacy judgements of dier-
ent combination methods and decompositions. . . . . . . . . . . . . . . . . . 69
4.2 Per-segment correlation with human
uency/adequacy judgements of base-
lines and dierent decompositions. N = 1 parses used. . . . . . . . . . . . . . 70
4.3 Considering
and N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4 Corpus statistics for the GALE 2.5 translation corpus. . . . . . . . . . . . . . 72
4.5 Per-document correlations of EDPM and others to HTER . . . . . . . . . . . 73
4.6 Per-sentence, length-weighted correlations of EDPM and others to HTER,
by genre and by source language. . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.1 Four mutually exclusive coherence classes for a span s and its projected range s0 83
5.2 GALE Mandarin-English manually-aligned parallel corpora . . . . . . . . . . 84
5.3 The Mandarin-English parallel corpora used for alignment training . . . . . . 86
5.4 Alignment error rate, precision, and recall for automatic aligners . . . . . . . 88
5.5 Coherence statistics over the spans delimited by comma classes . . . . . . . . 89
v
5.7 Some reasons for IP incoherence . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.8 Reranking the candidates produced by a committee of aligners. . . . . . . . . 99
5.9 Reranking the candidates produced by giza.union.NBEST. . . . . . . . . . . 100
5.10 AER, precision and recall for the bg-precise alignment . . . . . . . . . . . . 101
5.11 AER, precision and recall over the entire test corpus, using various XP -
strategies to determine trusted spans . . . . . . . . . . . . . . . . . . . . . . . 104
vi
many of which were mine. Bin and Wei tolerated both my questions about Chinese and
my eagerly verbose explanations of some of the crookeder corners of the English language.
Alex, Brian, Julie and Amittai were always game for engaging in a discussion about tactics
and strategies for natural-language engineering graduate students, and I am pleased to leave
my role as SSLI morale ocer in their hands.
Across the road in Padelford, my colleagues and teachers in the Linguistics department
have also been a pleasure. Beyond my committee members named already, I had the
pleasure of guidance and welcome from Julia Herschensohn, the departmental chair, whose
enthusiasm for an interdisciplinary computational linguist like me spared me a number of
administrative ordeals, some of which I'll probably never know about (and I am grateful
to Julia for that). Richard Wright and Alicia Beckford-Wassink were happy to let me be
an \engineering guy" in a room full of empirical linguists. Fellow students Bill, David, and
Scott reminded me from the very beginning that having spent time in industry does not
disqualify one from still studying linguistics. Lesley, Darren, Julia, Amy, and Laurie remind
me whenever I see them (which is often online rather than in person!) that linguistics can
be fun, whichever corner of it you live in.
Over the last two years, I have had the privilege of being hosted at the Speech Technology
and Research (STAR) laboratory at SRI International in Menlo Park, California. I began
my study there as part of the DARPA GALE project, on which SSLI and the STAR lab
collaborated. STAR director Kristin Precoda graciously allowed me to use oce space
and come to lab meetings, even after that project ended, while I nished my dissertation.
Dimitra, Wen, Fazil, Jing, Murat, Luciana, Martin, Colleen and Harry, support sta Allan
and Debra, and fellow SSLI alumni Arindam Mandal and Xin Lei also hosted and oriented
me during my time at SRI. All of them have been pleasant hosts and supportive colleagues.
I am doubly grateful that they tolerated my poor attempts at playing Colleen's guitar in
the break room.
I have had fruitful and enjoyable collaborations with students and faculty beyond UW
viii
For the pursuit of a life of love, play, and inquiry;
For my partner, my ally, my friend, my lover;
For what we have already and for what we make together;
For Dorothy.
xi
INTRODUCTION
Parsing, or extracting syntactic structure from text, is an appealing process to lin-
guists studying the grammatical properties of natural language: parsing is an application
of syntactic theory. For non-linguists, including many natural-language engineers, it is not
necessarily of immediate practical use. Engineers and other users of language technology
have generally found word sequences (as in writing) to be a more tractable input and out-
put, and traditional evaluation measures for their tasks have not considered any linguistic
structure beyond the word sequence in their design.
While some natural language applications have embraced parsing at their core (e.g. infor-
mation extraction, which generally begins from parsed sentence structures), this dissertation
applies parsers to two other domains: automatic speech recognition (ASR) and statistical
machine translation (SMT). In evaluation, both of these natural-language processing tasks
traditionally use measurements that evaluate using only matches of words or adjacent se-
quences of words (N -grams) against a reference (human-generated) output. In ASR, parsing
features and scores have been explored for improved modeling of word sequences, but these
approaches have not been widely adopted. Similarly, although a few SMT systems use a
parse tree in parts of decoding, parse structures are also not widely adopted in SMT. For
example, statistical word-alignment, a core internal technology for SMT, generally uses no
parse information to hypothesize links between source- and target-language words.
This dissertation explores the incorporation of parsing into representations of language
for natural language processing, particularly for components that have traditionally consid-
ered only the word sequence as input and output. This work takes two related approaches:
exploring new opportunities to bring the information provided by a parser to bear within
the traditional (syntactically-uninformed) approaches to these natural-language tasks, and
Hypothesis WER
Reference People used to arrange their whole schedules around those |
(a) people easter arrange their whole schedules around those 0.22
(b) people used to arrange their whole schedule and those 0.22
Table 1.2: Word-sequences not considered to match by nave word-sequence evaluation
The man saw the cat. The cat was seen by the man.
The diplomat left the room. The diplomat went out of the room.
He quickly left. He left quickly.
He warmed the soup. He heated the soup.
optimize optimise
don't do not
because cuz
captures the intuition that some words are more important to the sentence than others.
Table 1.1 considers two hypotheses that are projected to the same distance (WER = 0:22)
by the WER metric. In table 1.1, hypothesis (a) and hypothesis (b) have equal WER, but
(a)'s substitution is on a more central sequence (the main verb used to), while (b)'s word
errors are on a grammatical ax (schedule instead of schedules) and an adjunct adverbial
(around those). One indicator of the centrality of used to is that (b)'s substitution causes
little adjustment to the overall structure of the sentence, where (a)'s substitution leaves (a)
with no workable parse structure other than a fragment.
Conversely, table 1.2 presents some example word sequences that a human evaluator
might reasonably consider equivalent (for some evaluation tasks), and which a nave word-
sequence evaluation would score as dierent. To capture any of these matches, the evaluation
sequence must be able to nd a projection of the word sequences such that they may be
found equivalent. The last two pairs in table 1.2 are usually handled by normalization tools,
and sometimes spelling normalization, most evaluations consider only exact matches over
sections of the word-sequence, and treat all words as equally important.
Evaluation measures like WER (or extensions using N -grams) use only surface word
identity and word adjacency in their measurements. These measures incorporate neither a
notion of centrality nor argument structure, but individual words' roles in the meaning of a
sentence are determined by their relationship to other, not necessarily adjacent words. It is
the central contention of this work that extending our measurements and evaluations of the
word sequence to include a deeper representation of linguistic structure provides benets to
both linguistic and engineering approaches to natural language.
1.2 Using parse information within automatic language processing
The core theme of this work is the use of automatically-derived parse structure to improve
the performance and evaluation of language-processing systems that have generally used
only word-sequence level measures.
Parse decorations on the word sequence can provide benets to these systems in these
two ways:
parse decoration oers a new source of structural information within the models that
go into these systems, providing features from which the models may derive more
powerful hypothesis-choice criteria, and
parse decoration enables new target measures, for use in system tuning and/or eval-
uation of the overall performance of a system.
Both of these techniques are used in this dissertation in ASR and SMT applications. For
ASR systems, this work explores using parse structure for optimization towards both WER
and SParseval (an evaluation measure for parses of speech transcription hypotheses). For
SMT systems, this work explores using parse structure towards providing an evaluation
measure that correlates better with human judgement and towards the optimization of an
internal target (word-alignment).
side of the relatively small domain of treebanks), which is one reason that parse-information
has not been widely adopted into some of these systems. Parser accuracy, especially on gen-
res that do not match the parser's training data, may not be very good. This work adopts
the approach that a parser's own condence estimates may be used to avoid egregious
blunders, by using expectations (condence-weighted averages) over parser predictions. A
common thread among the research directions presented here is thus the use of more than
one parse-decoration hypothesis to provide structural information about the word sequence.
Previous work on applying grammatical structure to ASR systems has focused on either
parsing a single hypothesis transcript (the parsing task) or on using a single hypothesis parse
to select a transcript (the language-modeling task). By exploring the joint optimization of
parse and transcript hypotheses (chapter 3), this work demonstrates the utility of each to
the other. It frames the parse-decoration as a source of structural features of the hypothe-
ses, to be used in reranking hypotheses. In this approach, WER-optimization is improved
by including information from multiple parse hypotheses, and parse-metric optimization
is improved by comparing multiple parse hypotheses over multiple transcript hypotheses.
Because many NLP tasks either explicitly use parsing, chunking, or have verb-dependent
processing, the parse metric is often a better choice for word transcription associated with
NLP tasks.
After considering parsing as an ASR objective, we turn to incorporation of parse dec-
oration towards SMT tasks, beginning by considering SMT evaluation (chapter 4). SMT
evaluation measures have traditionally used only word-sequence information (e.g., measur-
ing the precision of n-grams against a reference translation). This work explores the use
of parsing dependency structure to provide a syntactically-sensitive evaluation measure of
the translation hypotheses. Parse structure, here, is represented as an expectation over
dependency structure (using the multiple-parse hypotheses approach suggested above), and
this work demonstrates that evaluations informed by parse-structure correlate more closely
with human judgements of translation quality than the traditional (word-sequence based)
metrics.
Previous work on applying parsers to SMT has focused mostly on parsing for reordering
explores the use of parsers in improving translation word-alignment (an internal component
of SMT). In this approach, parse-decoration is treated as labels on source-language spans,
and this information is applied to selecting better machine translation word-alignments, an
SMT task that generally uses only word-sequence information. In this work, we explore
the coherence properties of the parse-annotated spans, nding some span-classes that tend
to be coherent, in the sense that a contiguous sequence of source language words is not
broken up in translation. This syntactic coherence is used to guide the combination of a
precision-oriented and recall-oriented automatic alignment.
By exploring applying parse decoration to word sequences, this work oers several pieces
of evidence for new directions in language-processing work. Word sequences are not always
the best way to evaluate the performance of natural language processing systems; gram-
matical structure (from parsing) is in fact a useful source of information to these other
natural-language processing systems, even when used as a component in evaluation (in ma-
chine translation). As part of those results, this work oers new reasons to use and improve
work in syntactic parsers.
1.3 Overview of this work
The dissertation's structure is as follows: Chapter 2 covers the shared background material:
statistical parsing, and schematic overviews of the operation of ASR and SMT systems.
To accomodate the diversity of corpora and applications, some discussion of background
material and related work is deferred to the appropriate chapter, rather than covering all
background materials in chapter 2. Chapters 3{5 present the prior work, new methods and
experimental results of each of the three applications explored in this thesis.
Chapter 3 applies parsing to automatic speech recognition on English conversational
speech, and shows that information derived from parse structure oers improvements on
WER. In addition, when the ASR/parsing pipeline is directed to target a parse-quality
measure designed for speech transcripts, not only does the pipeline perform better on that
measure but it selects qualitatively dierent word sequences, re
ecting the eect of parse
structure (and its evaluation) on speech recognition.
provide a score ppm(jw) of pronunciation-representation given word w; and language
models (LMs, e.g. Stolcke [2002]; see Goodman [2001]) give a score plm(w1; ; wn) of the
word sequence w1; ; wn. In decoding, all three of the models descibed above operate
on a relatively small local window: pam() uses phone-level contexts, ppm() uses the word
in isolation or with its immediate neighbors, and plm() most often uses n-gram Markov
assumptions, computing word sequence likelihoods from only the most-recent n 1 words.
The most typical value for n is three, also known as a \trigram" model, and n rarely exceeds
four or ve, due to the computational explosion in storage costs required.
The rescoring component F (; ;w1; ; wn), by contrast, may use all of the above
scores and also extracts additional features of an utterance- or sentence-length hypothesis
from any of the values mentioned above for use in re-ordering the n-best list. Even with-
out the feature-extraction F (), the rescoring component may change the relative weight
of the contribution of the upstream models, but F () is often used to extract long-distance
(non-local) features that would be expensive or impossible to extract in the local-context
decoding that the other models provide. An exhaustive survey of prior work using rerank-
ing to capture non-local information in ASR is impractical, but the sorts of long-distance
information exploited include topic information, as in Iyer et al. [1994] or more recently
Naptali et al. [2010], or trigger information [Singh-Miller and Collins, 2007]. These model
long-distance eects from as far away as other sentences (or speakers!) in the same dis-
course, not with a syntactic model but with various approaches that cue the activation of
a dierent vocabulary subset. Another application of reranking operates by adjusting the
output of the generative model to focus on the specic error measure, as in e.g. Roark et al.
[2007]. Further discussion of the use of syntactic information in language-model rescoring
may be found in section 2.3.3.
2.3.2 Evaluation of ASR
Evaluation | and optimization | of speech recognition and its components are carried out
with word error rate (WER), a measure that treats words (or characters) equally, regardless
of their potential impact on a downstream application, as discussed in section 1.1; for
example, function words are given equal weight with content words. One exception is that
lled-pauses are, in some evaluations, e.g. GALE [DARPA, 2008], optionally inserted or
deleted without cost when evaluating speech.
A few larger projects that include ASR as a component have suggested extrinsic evalua-
tion methods: in dialog systems, for example, ASR performance is evaluated along with the
other components with a measure of action accuracy (e.g. in Walker et al. [1997] and Lamel
et al. [2000]). In the 2005 Summer Workshop on Parsing Speech [Harper et al., 2005], speech
recognition was evaluated in the extrinsic context of a downstream parser, but only a sin-
gle transcription hypothesis was used. Al-Onaizan and Mangu [2007] explored adjustments
to ASR hypothesis selection in an ASR-to-MT pipeline to allow relatively more insertions
(keeping the WER constant), but found that this made little dierence in automatically-
evaluated MT performance.
As an alternative to evaluating ASR with WER or evaluating it directly in the context
of a downstream task, one may instead choose to optimize the ASR towards an improved
form of some intermediate representation (neither the immediate word sequence nor a fully-
extrinsic representation). Hillard et al. [2008], for example, experimented with selecting for
high-SParseval Chinese character-sequences for a downstream Chinese-to-English SMT
system (instead of selecting low character error rate (CER) hypotheses). In follow-up work,
Hillard [2008] found improvement on the automatic SMT measures for unstructured (broad-
cast conversation) genres of speech, though not for structured speech (broadcast news).
Additionally, they found that SParseval measurements of source-language transcription
were better correlated with human assessment of MT performance in the target language
than CER measurements. Intrinsic measures for ASR, however, are almost entirely limited
to WER or its simpler alternative for Chinese, CER.
Chapter 3, which uses parse decoration to rerank ASR transcription hypotheses, evalu-
ates ASR with WER and also with the SParseval parse-quality measure.
ments hypothesized by earlier iterations. The language model plm(e) does not participate in
this phase of the training: in a bitext, predicting plm(e) is not helpful; language models are
usually trained separately, using monolingual text. As a byproduct of the parameter-search
to improve these models, the GIZA++ toolkit produces a best alignment linking each word
in e to words in f .
Other tools exist for generating alignments (such as the Berkeley aligner [DeNero and
Klein, 2007]) and there is substantial discussion over how to evaluate and improve the
quality of these alignments. Review of this discussion is passed over here; we will return to
this literature in chapter 5.
Typical independence assumptions in the word-alignment models constrain them to word
sequence and adjacency, applying a penalty for moving words into a dierent order in
translation. These models for reordering penalties are usually very simple, and do not
incorporate any notion of parse decoration | instead, they assign monotonically-increasing
penalties for moving words in translation. For example, Vogel et al. [1996] uses a hidden
Markov model (HMM, derived from only sequence information) to assign a prm() reordering
model. Language-models in translation are also generally sequence-driven: ASR's basic n-
gram language-modeling approach serves as an excellent baseline to model plm(e) in MT
work. Early stages in the training bootstrapping sometimes ignore even word sequence
information: GIZA++'s \Model 1" treats prm() as uniform and ptm() as independent of
adjacency information (dependent only on the alignment links themselves).
For language-pairs like French-English, where word-order is largely similar, the local-
movement penalties of these simple prm() models usefully constrain the search space of
possible translations to those without large re-ordering: the language- and translation-model
scores will correctly handle any necessary small, local reorderings. For other language-pairs
(e.g., Chinese-English or Arabic-English), though, long-distance re-orderings are necessary,
and these models must assign a small penalty to long-distance movement, which leads to
an explosion in the search space (and a corresponding loss in translation quality).
Having bootstrapped from bitext to word-based alignments, many SMT systems (e.g.
Pharaoh [Koehn et al., 2003] and its open-source successor Moses [Koehn et al., 2007]) take
the bootstrapping farther by automatically extracting a \phrase table" from the aligned
operate over hundreds (or thousands!) of sample translations of the same sentence.
The two most popular of the automatic metrics are the BLEU [Papineni et al., 2002] mea-
sure of n-gram precision and the TER [Snover et al., 2006] edit distance. BLEU [Papineni
et al., 2002], a measure of n-gram precision, remains the most popular and widely-reported
measure for measuring translation quality against a reference translation (or set of reference
translations). BLEU is a geometric mean of precisions over varying N -gram lengths:
BLEUn(h; r) = n
v
u
u
t
nY
i=1
i(h; r) BP (h; r) (2.3)
where i(h; r) re
ects the precision of the i-grams in hypothesis h with respect to reference
r, and the term BP(h; r) is a \brevity penalty" to discourage the production of extremely
short (low-recall, high-precision) translations:
BP(h,r) =
8
><
>:
exp
1 jrjjhj
if jhj < jrj
1 if jhj jrj
Most results are reported with BLEU4.
Translation Edit Rate (TER) is an error measure like WER, which measures the oper-
ations required to transform hypothesis h into reference r:
TER(h; r) =
insertions(h; r) + deletions(h; r) + substitutions(h; r) + shifts(h; r)
length(r)
(2.4)
where insertions, deletions and substitutions count one per word, while shift operations
move any adjacent sequence of words from one position in h to another. Insertion, deletion,
substitution and shift error counts are calculated through an alignment between reference
and hypothesis that heuristically minimizes the total number of operations needed.
When working with multiple references, BLEU4 is dened so that its n-grams may match
those in any of the references, allowing translation variability across the multiple references,
but TER's approach to multiple references is just to return the minimum edit ratio over
the set of references, which is less forgiving to the candidate translation.
Like word error rate for ASR, the BLEU and TER metrics use no syntactic or argument-
structure modeling to determine which words matter more: all words are treated equally.
In TER, substituting or shifting a single word incurs the same cost regardless of where the
substitution or shift happens; in BLEU, all hypothesis n-grams contribute equally to the
score of the sentence. Because of the emphasis on these automatic measures, innovations in
MT have often focused on the innovations' eects on these measures directly, sometimes to
the point of reporting only on one of these entirely automatic measures.
Some have raised skepticism towards the focus on the BLEU and TER automatic mea-
sures on theoretical [Callison-Burch, 2006] and empirical [Charniak et al., 2003] grounds, in
that they do not always accurately track translation quality as judged by a human annota-
tor, and they may not even reliably separate professional from machine translations [Culy
and Riehemann, 2003]. Other automatic MT measures have been proposed, some of which
use parse decorations. Chapter 4 describes some of these alternatives in more detail.
An ideal automatic measure would correlate well with human judgements of translation
quality. However, judgements of
uency and adequacy themselves are highly variable across
annotators. Rather than correlate with these measurements, one may instead examine the
correlation with a dierent human-derived measure of translation quality: Snover et al.
[2006] propose Human-targeted Translation Edit Rate (HTER), a measurement of the work
performed by a human editor to correct the translation until it is equivalent to the reference
translation. They show that a single HTER score is very well-correlated to
uency/adequacy
judgements, and has lower variance: they nd that a single HTER score is more predictive of
a held-out
uency/adequacy judgement than a single
uency/adequacy judgement. HTER
still requires human intervention, but, probably because of its consistency in evaluation,
it has been adopted as the evaluation standard for the DARPA GALE project [DARPA,
2008].
2.4.3 Parsing in MT
Early explorations of the application of syntactic structure to SMT were explored as an
alternative to the phrase-table approach. Yamada and Knight [2001] and Gildea [2003] in-
corporate operations on a treebank-trained target-language parse tree to represent p(f ja; e)
and p(aje), but have no \phrase" component; Charniak et al. [2003] apply grammatical
structure to the p(e) language-model component. These approaches met with only moder-
ate success.
Rather than building a syntactic model into the decoder or language model, others pro-
posed automatically [Xia and McCord, 2004, Costa-jussa and Fonollosa, 2006] and manually
coded [Collins et al., 2005a, Popovic and Ney, 2006] transformations on source-language
trees, to reorder source sentences from f to f 0 before training or decoding (translation
models are trained on bitexts with f 0 and e). Zhang et al. [2007] extend this approach
by inserting an explicit source-to-source \pre-reordering" model pr0(f 0jf) to provide lattice
input alternatives to the main translation.
The phrase-table models described in section 2.4.1 capture some local syntactic struc-
ture | even when the phrases are simply reliably-adjacent word-sequences | by virtue
of recording actually-observed n-grams in the source- and target-language sequences, but
these models oer additional power when they are made syntactically aware. Syntactically-
aware decoders are united with the phrase-table approach in such approaches as the ISI
systems [Galley et al., 2004, 2006, Marcu et al., 2006], the systems built by Zollmann et al.
[2007], and recently the Joshua open-source project [Li et al., 2009]. Each of these builds
syntactic trees over the target side of the bitext in training and learn phrase-table entries
with syntactically-labeled spans. Conversely, Quirk et al. [2005] and Xiong et al. [2007]
construct phrase-table entries using source-language dependency structure, while Liu et al.
[2006a] applies a similar technique using constituent structure instead of dependency.
Rather than pursue these phrase-table based decoder models directly, chapter 5 of this
work explores mechanisms to use parsers to improve the word-to-word alignments that are
the material from which the phrases are learned.
2.5 Summary
This chapter has provided an overview of four key technologies for the remainder of this
work: statistical parsing, n-best list reranking, automatic speech recognition, and statistical
machine translation. Special attention is paid to the interaction of parsers with speech
recognition, the evaluation of speech recognition and machine translation, and the existing
roles of syntactic structure in statistical machine translation. The next three chapters
use parsers (and rerankers) in various combinations on conversational speech recognition
(chapter 3), machine translation evaluation (chapter 4), and on improving word alignment
quality for machine translation (chapter 5). Further details on related work more directly
related to this thesis are provided in each chapter.
Chapter 3
PARSING SPEECH
Parse-decoration on the word sequence has a strong potential for application in the
domain of automatic speech recognition (ASR). Extracting syntactic structure from speech
is more challenging than ASR or parsing alone, because the combination of these two stages
introduces the potential for cascading error, and most parsing systems assume that the leaves
(words) of the syntactic tree are xed. This chapter1 applies parse structure as an additional
knowledge source, even when the evaluation targets do not include parse structure explicitly.
It also considers the benets to parsing of considering alternative speech transcripts (when
the evaluation targets are parse measures themselves).
We thus consider recognition and parsing as a joint reranking problem, with uncertainty
(in the form of multiple hypotheses) in both the recognizer and parser components. In this
joint problem, there are two possible targets: word sequence quality, measured by word
error rate (WER), and parse quality, measured over speech transcripts by SParseval. For
both these targets, sentence boundary concerns have largely been ignored in prior work:
speech recognition research has generally assumed that sentence boundaries do not have
a major impact, since the placement of segment boundaries in a string does not aect
WER on that string. Parsing research, on the other hand, has generally assumed that
sentence boundaries are given (usually by punctuation), since most parsing research has
been on text. Spoken language, unlike written language, does not have explicit markers for
sentence and paragraph breaks; i.e., punctuation is not verbalized. Sentence boundaries in
spoken corpora must therefore be automatically recognized, introducing another source of
diculty for the joint recognition-and-parsing problem, regardless of the target: sentence
segmentation.
1Tthe work presented in this chapter is included in a paper that has been accepted to Computer Speech
and Language.
Although there has been a substantial amount of research on speech recognition, seg-
mentation of spoken language, and parsing (as described in the next section), there has
been little work exploring automation of all three together. Most research has incorporated
only one or two of these areas, typically treating recognition and parsing as separable pro-
cesses. In this chapter, we combine recognition and parsing using discriminative reranking:
selecting optimal word sequences from the N -best word sequences generated from a speech
recognizer given cues from M parses for each, and selecting optimal parse structure from the
N M -best parse structures associated with these word sequences. At the same time, we
explore the impact of automatic segmentation. We ask the following inter-related questions:
In the task of extracting parse structure from conversational speech, how much can
we improve performance by exploiting the uncertainty of the speech recognizer?
In the word recognition task, does a discriminative syntactic language model benet
from incorporating parse uncertainty in parse feature extraction?
How does segmentation aect the usefulness of parse information for improving speech
recognition, and what is its impact on parsing accuracy, given alternative word se-
quences and alternative parse hypotheses?
Section 3.1 discusses the relevant background for this research integrating speech segmen-
tation, parsing, and speech recognition. Section 3.2 outlines the experimental framework in
which this chapter explores those questions, while section 3.3 describes the corpus and the
conguration of the various components of this system. Section 3.4 describes the results of
those experiments, and section 3.5 discusses these results in the context of the dissertation
as a whole.
3.1 Background
Our approach to parsing conversational speech builds on several active research areas in
speech and natural language processing. This section extends the review from chapter 2 to
highlight the prior work most related to the work in this chapter.
3.1.1 Parsing on speech and its evaluation
As discussed in section 2.1.4, most parsing research has been developed with the parseval
metric [Black et al., 1991], which was inititally developed for parse measurement on text.
It was used in initial studies of speech based on reference transcripts (without considering
speech recognizer errors). The grammatical structures of speech are dierent than those of
text: for example, Charniak and Johnson [2001] demonstrated the usefulness (as measured
by parseval) of explicit modeling of edit regions in parsing transcripts of conversational
speech.
Unfortunately, parseval is not well-suited to evaluating parses of automatically-recognized
speech. In particular, when the words (leaves) are dierent between reference and hypoth-
esized trees (as will be the case when there are recognition errors), it is dicult to say
whether a particular span is included in both, and the parseval measure is not well de-
ned. Roark et al. [2006] introduce alternative scoring methods to address this problem
with SParseval, a parse evaluation toolkit. The SParseval method used here takes into
account dependency relationships among words instead of spans. Specically, CFG trees
are converted into dependency trees using a head-nding algorithm and head percolation of
the words at the leaves. Each dependency tree is treated as a bag of triples hd; r; hi where
d is the dependent word, r is a symbol describing the relation, and h is the dominating
lexical headword (central content word in the phrase). Arc-labels r are determined from
the highest constituent label in the dependent and the lowest constituent label dominating
the dependent and the head. SParseval describes the overlap between the \gold" and
hypothesized bags-of-triples in terms of precision, recall and F measure.
Overall, SParseval allows a principled incorporation of both word accuracy and accu-
racy of parse relationships. Since every triple (the dependency-pair and its link label, as
in gure 3.1) involves two words, this measure depends heavily on word accuracy, but in a
more complex way than word error rate, the standard speech recognition evaluation met-
ric. Figure 3.1 demonstrates a number of properties of the SParseval measure. Although
both (b) and (c) have the same word error (one substitution each), they have very dierent
precision and recall behavior. As the gure suggests, the SParseval measure over-weights
(a) S/think
NP/I
I
VP/think
AdvP/really
really
VP/think
V/think
think
AdvP/so
so
(I, S/NP, think)
(really, VP/AdvP, think)
(think, <s>/S, <s>)
(so, VP/AdvP, think)
(b) S/think
S/think
NP/I
I
VP/think
AdvP/really
really
VP/think
V/think
think
DM/yeah
yeah
Precision = 34 , Recall =
3
4
Word Error Rate = 14
(I, S/NP, think)
(really, VP/AdvP, think)
(think, <s>/S, <s>)
(yeah, S/DM, think)
(c) S/sink
NP/I
I
VP/sink
AdvP/really
really
VP/sink
V/sink
sink
AdvP/so
so
Precision = 04 , Recall =
0
4
Word Error Rate = 14
(I, S/NP, sink)
(really, VP/AdvP, sink)
(sink, <s>/S, <s>)
(so, VP/AdvP, sink)
Figure 3.1: A SParseval example that includes a reference tree (a) and two hypothesized
trees (b,c) with alternative word sequences. Each tree lists the dependency triples that
it contains; bold triples in the hypothesized trees indicate triples that overlap with the
reference tree. Although all have the same parse structure, tree (c) is penalized more
heavily (no triples right) because it gets the head word think wrong.
assume that segment-boundary conditions are matched between LM training and test: Stol-
cke [1997] demonstrated that adjusting pause-based ASR n-best lists to take into account
segment boundaries matched to language model training data gave reductions in word error
rate.
3.1.3 Parse features in reranking
Section 2.3.3 discussed general approaches to using parsing as a language model, including
parsing language-models like Chelba and Jelinek [2000] and Roark [2001]. Reranking, as
discussed in section 2.2, is applied to parsers [Collins and Koo, 2005] but also to language-
modeling for ASR, with [e.g., Collins et al., 2005b] and without [Roark et al., 2007] parse
features.
Collins et al. [2005b] does discriminative reranking using features of the parse structure
extracted from a single-best parse of the English ASR hypothesis. Arisoy et al. [2010] used
a similar strategy for Turkish language modeling. In both cases, the objective was the
minimization of WER. Harper et al. [2005] and others, as mentioned above, use reranking
with the parsing objective over automatic speech transcripts. However, neither the syn-
tactic language-modeling work using syntax nor the parsing work using automatic speech
transcripts considers the variable hypotheses of both the speech recognizer and the parser in
a reranking context. Using both variables together is the approach pursued in this chapter.
3.2 Architecture
The system for handling conversational speech presented in this chapter is illustrated
schematically in gure 3.2 and involves the following steps:
1. a speech recognizer, which generates speech recognition lattices with associated
probabilities from an audio segment (here, a conversation side);
2. a segmenter which detects sentence-like segment boundaries E, given the top word
hypothesis from the recognizer and prosodic features from the audio;
Figure 3.2: System architecture at test time.
3. a resegmenter which applies the segment boundaries E to confusion networks de-
rived from the lattices and generates an N -best word hypothesis cohort W s for each
segment s, made up of word sequences wi with associated recognizer posteriors pw(wi)
for each of the N sequences wi 2W s;
4. a parser component which generates an M -best list of parses ti;j , j = 1; : : : ;M ,
for each wi 2 W s, along with condences pp(ti;j ; wi) for each parse over each word
sequence (all the ti;j for a given segment s make up the parse cohort T s)
5. a feature extractor which extracts a vector of descriptive features fi;j over each
member of the parse structure cohort which together make up the feature cohort F s;
and
6. a reranker component which selects an optimal vector of features (and thus a pre-
ferred candidate) from the cohort and eectively chooses an optimal hw; ti, which
of a sequence of word slots where each slot contains a list of word sequence hypotheses
with associated posterior probabilities [Mangu et al., 2000]. Because the slots are linearly
ordered, they can be cut and rejoined at any inter-slot boundary. All the confusion net-
works for a single conversation side are concatenated. Speaker diarization (the relationship
between this conversation side and the transcription of the interlocutor) is not varied. The
concatenated confusion network is then cut at locations corresponding to the hypothesized
segment boundaries, producing a segmented confusion network. Each candidate segmenta-
tion produces a dierent re-cut confusion network.
These re-cut confusion networks are used to generate W s, an N -best list of transcription
hypotheses, for each hypothesized segment s from the target segmentation. Each transcrip-
tion wi of W s has a recognizer condence pr(wi), calculated as
pr(wi) =
len(wi)Y
k=1
pr(wik) (3.1)
where pr(wik) is the confusion network condence of the word selected for wi from the k-th
slot in the confusion net. This posterior probability pr(wik) is derived from the recognizer's
forward-backward decoding where the acoustic model, language model, and posterior scaling
weights are tuned to minimize WER on a development set.
3.2.2 Feature extraction
After creating the parse cohort T s from the word-sequence cohort W s, each member of the
parse cohort is a word-sequence hypothesis wi with a parse tree ti;j projected over it, along
with two condences: the ASR system posterior pr(wi) and the parse posterior pp(ti;j ; wi).
The feature-extraction step (step 5) extracts additional features and organizes all of these
features into a vector fi;j to pass to the reranker. The feature extraction is organized to
allow us to vary fi;j to include dierent subsets of those extracted.
In this subsection, we present three classes of features extracted from our joint recognizer-
parser architecture: per-word-sequence features, generated directly from the output of
the recognizer and resegmenter and shared by all parse candidates associated with a tran-
scription hypothesis wi; per-parse features, generated from the output of the parser,
Table 3.1: Reranker feature descriptions for parse ti;j of word sequence wi
Feature Description Feature Class
pr(wi) Recognizer probability per-word-
sequence
features
Ci Word count of wi
Bi True if wi is empty
pp(ti;j ; wi) Parse probability
per-parse features
(ti;j) Non-local syntactic features
pplm(wi) Parser language model aggregated parse
featuresE[ i] Non-local syntactic feature expectations
which are dierent for each parse hypothesis ti;j ; and aggregated-parse features, con-
structed from the parse candidates but which aggregate across all ti;j that belong to the
same wi. The features are listed in Table 3.1. All of the probability features p() are
presented to the reranker in logarithmic form (values 1 to 0).
Per-word-sequence features
Two recognizer outputs are read directly from the N -best lists produced in step 3 and
re
ect non-parse information. The rst is the recognizer language-model score pr(wi), which
is calculated from the resegmenter's confusion networks as described in equation 3.1. A
second recognizer feature is the number of words Ci in the word hypothesis, which allows
the reranker to explicitly model sequence length. Lastly, an empty-hypothesis indicator Bi
(where Bi = 1 when Ci = 0) allows the reranker to learn a score to counterbalance for lack
of a useful parse score. (It is possible that a segment will have some hypothesized word
sequences wi that have valid words and some that contain only noise, silence or laughter,
i.e., an empty hypothesis, which would have no meaningful parse.)
Table 3.2: Switchboard data partitions
Partition Sides Words
Train 1042 654271
Dev 116 76189
Eval 128 58494
described in Kahn [2005], which are summarized here. Various aspects of the syntactic an-
notation beyond the scope of this task|for example, empty categories|were removed. The
parses were also resegmented to match the SU segments, with some additional rule-based
changes performed to make these annotations more closely match the LDC SU conventions.
In the resegmented trees, constituents spanning manually-annotated segment boundaries
were discarded, and multiple trees within a single manually annotated segment were sub-
sumed beneath a top-level SUGROUP constituent. To match the speech recognizer output,
punctuation is removed, and contractions are retokenized (e.g., can + n't ) can't).
The corpus was partitioned into training, development and evaluation sets whose sizes
are shown in Table 3.2. Results are reported on the evaluation set; the development set was
used during debugging and for exploring new feature-sets for f , but no results from it are
reported here.
3.3.1 Evaluation measures
Word recognition performance is evaluated using word-error rate measurements generated
by the NIST sclite scoring tool [NIST, 2005] with the words in the reference parses taken
as the reference transcription. Because we want to compare performance across dierent
segmentations, WER is calculated on a per-conversation side basis, concatenating all the
top-ranked word sequence hypotheses in a given conversation side together. When com-
paring the statistical signicance of dierent results between congurations, the Wilcoxon
Signed Rank test provided by sclite is used.
For parse-quality evaluation, we use the SParseval toolkit [Roark et al., 2006], again
calculated on a per-conversation side basis, concatenating all the top-ranked parse hypothe-
ses in a given conversation. We use the setting that invokes Charniak's implementation
of the head-nding algorithm and consider performance over both closed- and open-class
words. When comparing the statistical signicance of SParseval results, we use a per-
segment randomization [Yeh, 2000].
3.3.2 Component congurations
Speech recognizer
The recognizer is the SRI Decipher conversational speech recognition system [Stolcke et al.,
2006], a state-of-the-art large-vocabulary speech recognizer that uses various acoustic and
language models to perform multiple recognition and adaptation passes. The full system has
multiple front-ends, each of which produce n-best lists containing up to 2000 word sequence
hypotheses per audio segment, which are then combined into a single set of word sequence
hypotheses using a confusion network. This system has a WER of 18.6% on the standard
NIST RT-04 evaluation test set.
Human-annotated reference parses are required for all the data involved in these exper-
iments. Unfortunately, because they are dicult to create, reference parses are in short
supply, and all the Switchboard conversations used in the evaluation of this system are
already part of the training data for the SRI recognizer. Although it represents only a very
small part of the training data (Switchboard is only a small part of the corpus, and the
data here are restricted to the hand-parsed fraction of Switchboard), there is the danger
that this will lead to unrealistically good recognizer performance. This work compensates
for this potential danger by using a less powerful version of the full recognizer, which has
fewer stages of rescoring and adaptation than the full system and a WER of 20.2% on the
RT-04 test set. On our evaluation set from Switchboard, this system has a 22.9% WER.
Segmenter
Our automatic segmenter [Liu et al., 2006b] frames the sentence-segmentation problem as a
binary classication problem in which each boundary between words can be labeled as either
Table 3.3: Segmentation conditions. F and SER report the SU boundary performance over
the evaluation section of the corpus.
Segmentation # Segments Average
condition threshold F SER Train Eval length
Pause-based NA 0.62 0.61 54943 5693 10.3
Min-SER 0:5 0.77 0.45 86681 8417 6.9
Over-segmented 0:35 0.78 0.46 96627 9369 6.2
Reference NA (1.00) (0.00) 91254 8779 6.7
Resegmenter
Given the confusion network representation of the speech recognition output, the main
task of resegmentation is generating N -best lists given a new segmentation condition for
the confusion networks. For a given segment, the lattice-tool program from the SRI
Language Modeling Toolkit [Stolcke, 2002] is used to nd paths through the confusion
network ranked in order of probability, so the N most probable paths are emitted as an
N -best list hw1 : : : wN i, where each wi is a sequence of words. For these experiments, the
N -best lists are limited to at most N = 50 word sequence hypotheses.
Parser
Our system uses an updated release of the Charniak generative parser [Charniak, 2001] (the
rst stage of the November 2009 updated release of [Charniak and Johnson, 2005], without
the discriminative second-stage component) to do the M -best parse-list (and parse-score)
generation. As in Kahn [2005], we do not implement a separate \edit detection" stage
but treat edits as part of the syntactic structure. The parser is trained on the entire
training set's reference parses; no parse trees from other sources are included in the training
set. We generate M = 10 parses for each word sequence hypothesis, based on analyses
(presented later) that showed little benet from additional parses and much more benet
from increasing the number of sentence hypotheses. If the parser generates less than M
hypotheses, we take as many as are available. For the full system, we train a single parser
on the entire training set; for providing training cohorts to the reranker, the parser is trained
on round-robin subsets of the training set, as discussed in section 3.3.2.
Feature extractor
The extraction of non-local syntactic feature (ti;j) uses the software and feature denitions
from Charniak and Johnson [2005]. For tractability, we prune the set of features to those
with non-zero (and non-uniform) values within a single segment's hypothesis set for more
than 2000 segments, which is approximately 2% of the total number of training segments
(as in the parse-reranking experiments in Kahn et al. [2005]). Pruning is done separately
for each segmentation of the training set, yielding about 40,000 non-local syntactic features
under most segmentation conditions.2
The aggregate parse features pplm(wi) and E[ i] are calculated by sums across the M
parses generated for each wi. We assume that this approximation (instructing the parser to
return no parses after the M -th) has no important impact on the value of these features.
Reranker
As discussed in section 2.2, the reranker component of our system is the svm-rank tool from
Joachims [2006]. The reranker needs to be trained using candidate parses from a data set
that is independent of the parser training and the evaluation test set. Because of the limited
amount of hand-annotated parse tree data, we did not want to create a separate training
partition just for this model. Instead, we adopt the round-robin procedure described in
Collins and Koo [2005]: we build 10 leave-n-out parser models, each trained on 9=10 of the
training set, and run each on the tenth that it has not been exposed to. The resulting parse
candidate sets are passed to the feature-extraction component and the resulting vectors
(and their objective function values) are used to train the reranker models.
2Our non-local syntactic feature set is thus slightly dierent for each segmentation, since the number
and content of the set of segments vary among segmentations. The pause-based segmentation, with
substantially longer segments, selects about 28,000 features under this pruning condition; others have
about 40,000.
To avoid memory constraints, we assign each segment to one of 10 separate bins and
train 10 svm-rank models.3 For each experimental combination of segmentation and features
in fi;j , we re-train all 10 rerankers. At evaluation time, the cohort candidates are ranked
by all 10 models and their scores are averaged. The parse (or word-sequence) of the top-
ranked candidate is taken to be the system's hypothesis for a given segment, and evaluated
according to either the WER or SParseval objective.
3.4 Results
This section describes the results of experiments designed to assess the potential for per-
formance improvement associated with increasing the number of word-sequence vs. parse
candidates, as well as the actual gains achieved by reranking under both WER and SParse-
val objectives and dierent segmentation conditions. We also include a qualitative analysis
of improvements.
3.4.1 Baseline and Oracle Results
To provide a baseline, we sequentially apply the recognizer, segmenter, and parser, choosing
the top scoring word-sequence and then the top parse choice. We establish upper bounds
for each objective by selecting the candidate from the M N parse-and-word-sequence
cohort that scores the best on each objective function. The results of these experiments are
reported in tables 3.4 (optimizing for WER with M = 1 and dierent N) and 3.5 (optimizing
for SParseval with N = 50 and dierent M). The number in parentheses corresponds to
the mismatched condition | picking a candidate based on one criterion and scoring it with
another. Both sets of results show that improving one objective leads to improvements in
the other, since word errors are incorporated into the SParseval score.
Table 3.4 shows that the N -best cohorts contain a potential WER error reduction of
32%. Larger gains are possible for the shorter-segment segmentation conditions, due to the
increase in the number of available alternatives when generating N -best lists from more
3Each candidate set is generated by a single leave-n-out parser (populated by conversation-side), but each
svm-rank bin (populated by segments, not by conversation sides) includes some cohorts from each of the
leave-n-out tenths.
Table 3.6: Reranker feature combinations. Additionally all feature sets also contain the
per-word-sequence features pr(wi), Ci and Bi.
Feature Set Additional features Per
ASR (No additional features) word sequence
ParseP pp(ti;j ; wi) parse
ParseLM pplm(wi) word sequence
ParseP+NLSF pp(ti;j ; wi); (ti;j) parse
ParseLM+E[NLSF] pplm(wi); E[ i] word sequence
3.4.2 Optimizing for WER
We also investigate whether providing multiple M -best parses to the reranker augments
the parsing knowledge source when optimizing for WER (compared to using only one parse
annotation, or to using no parse annotation at all). To examine this, we explore dierent
alternatives for creating the feature-vector representation fi;j of a word-sequence candidate,
as summarized in Table 3.6. All experiments include recognizer condences pw(wi), word
count Ci, empty-hypothesis
ag Bi, and parser posteriors pp(ti;j ; wi) in the feature vector.
Table 3.6 shows all the feature combinations investigated with the feature names used here.
Table 3.7 shows the WER results of all the segmentation conditions and feature sets,
which can be compared to the baseline serial result of 23.7%. Reranking with the ASR
features alone does not improve performance, since there is little that the reranker can
learn (acoustic and language model scores are combined in the process of generating N -best
lists from confusion networks). The WER performance is worse than baseline on the Min-
SER and Ref segmentations, possibly because these segments are relatively longer than the
Over-seg condition, making word length dierences a less useful feature. Other results in
table 3.7 conrm that non-local syntactic features (ti;j) (NLSF here) are useful for word
recognition, conrming the results from Collins et al. [2005b]. In addition, there are some
new ndings. First, SU segmentation impacts the utility of the parser for word transcription
(as well as for parsing). There is no benet to using the parse probabilities alone except
Table 3.9: Results under dierent segmentation conditions when optimizing for SParseval
objective; the associated WER results are reported in parentheses.
Segmentation
Features Pause-based Min-SER Over-seg Ref seg
Baseline (23.7) 68.2 (23.7) 70.7 (23.7) 70.9 (23.7) 72.5
ParseP (24.1) 68.8 (24.0) 71.1 (24.0) 71.3 (23.2) 73.4
ParseP+NLSF (24.3) 69.1 (25.5) 70.4 (25.8) 70.4 (23.5) 73.1
oracle (20.3) 74.4 (19.7) 78.0 (19.3) 78.5 (18.3) 82.3
parse probability alone, and in some cases hurt performance, which seems to contradict
prior results in parse reranking. However, as shown in Figure 3.5, there is an improvement
due to use of features for the case where there is only N = 1 recognition hypothesis, but
that improvement is small compared to gains from increasing N . Figure 3.5 also shows
that optimizing for WER with non-local syntactic features actually leads to better parsing
performance than when optimizing directly for parse performance. We conjecture that this
result is due to overtraining the reranker when the feature dimensionality is high and the
training samples are biased to have many poorly-scoring candidates. The parsing problem
involves many more candidates to rank than WER (300 vs. 30 on average) because parse-
reranking has M N candidates while transcription-reranking has at most N candidates.
Since the pool of M N is much larger, it contains more poorly-ranking candidates and
thus the learning may be dominated by the many pairwise cases involving poor-quality
candidates.
3.4.4 Qualitative observations
We examined the recognizer outputs for the WER optimization with the ParseLM and
expected NLSF features to understand the types of improvements resulting from using a
parse-based language-model for re-ranking. Under this WER optimization on reference
segmentation, of the 8,726 segments in the test set, 985 had WER improvements and 462
she was there [they're] like all winter semester
they're [there] going to school
we're [where] the old folks now
(Contraction corrections like these are not included in the count for short main verb cor-
rections.) Further improvements are found in complementizers and prepositions (about 5%
each), while only about 10% of the improvements changed content words. The remaining
45% of improvements are miscellaneous.
Another pronoun example illustrates how the parse features can overcome the bias of
frequent n-grams in conversational speech:
Improved: they *** really get uh into it
Baseline: yeah yeah really get uh into it
Reference: they uh really get uh into it
with substitution errors in italics and deletions indicated by \***." (The bigram \yeah
yeah" is very frequent in the Switchboard corpus.)
Of the segments that suered WER degradation under ParseLM+E[NLSF] WER op-
timization, a little more than 15% were errors on a word involved in a repetition or self-
correction, e.g. the omission of the boldface the in:
. . . that's not the not the way that the society is going
Another 7-10% of these candidates that had WER degradation were more grammatically
plausible than the reference transcription, e.g. the substitution of a determiner a for an
unusually-placed pronoun (probably a correction):
Reference: but i lot of times i don't remember the names
Optimized: but a lot of times i do not remember the names
Most importantly, these last two classes of WER degradation do not have an impact on the
meaning of the sentence. The remaining roughly 75% of the WER-degraded segments are
dicult to characterize, but are a large-majority of function-words as well.
Any syntactic-dependency-oriented measure requires a system for proposing dependency
structure over the reference and hypothesis translations. Liu and Gildea [2005] use a PCFG
parser with deterministic head-nding, while Owczarzak et al. [2007a] extract the seman-
tic dependency relations from an LFG parser [Cahill et al., 2004]. This chapter's work
extends the dependency-scoring strategies of Owczarzak et al. [2007a], which reported sub-
stantial improvement in correlation with human judgement relative to BLEU and TER,
by using a publicly-available probabilistic context-free grammar (PCFG) parser and deter-
ministic head-nding rules, rather than an LFG parser. In addition, this chapter considers
alternative syntactic decompositions and alternative mechanisms for computing score com-
binations. Finally, the work presented here explores combination of syntax with synonym-
and paraphrase-matching scoring metrics.
Evaluation of automatic MT measures requires correlation with MT evaluation mea-
sures performed by human beings. Some [Banerjee and Lavie, 2005, Liu and Gildea, 2005,
Owczarzak et al., 2007a] compare the measure to human judgements of
uency and ade-
quacy. Other work Snover et al. [e.g. 2006] compares measures' correlation with human-
targeted TER (HTER), an edit-distance to a human-revised reference. The metrics de-
veloped here are evaluated in terms of their correlation against both
uency/adequacy
judgement and against HTER scores.
4.2 Approach: the DPM family of metrics
The specic family of dependency pair match (DPM) measures described here combines
precision and recall scores of various decompositions of a syntactic dependency tree. Rather
than comparing string sequences, as BLEU does with its n-gram precision, this approach
defers to a parser for an indication of the relevant word tuples associated with meaning | in
these implementations, the head on which that word depends. Each sentence (both reference
and hypothesis) is converted to a labeled syntactic dependency tree and then relations from
each tree are extracted and compared. These measures may be seen as generalizations of
the earlier paper is cited in comparisons. Section 4.6 includes synonym matching, but over data which
are not directly comparable with either Owczarzak paper and using an entirely dierent mechanism for
combination.
root/stumbled
s/stumbled
np/cat
dt/the
the
nn/cat
cat
vp/stumbled
vbd/stumbled
stumbled The cat stumbled hrooti
np/dt s/np root/s
Figure 4.3: An example headed constituent tree and the labeled dependency tree derived
from it.
labelled-dependency SParseval [Roark et al., 2006], and may be considered as a shallow
approximation of the rich semantics generated by LFG parsers [Cahill et al., 2004]. The
A=B labels are not as descriptive as the LFG semantics, but they have a similar resolution
in English (with its relatively xed word order), e.g. the s/np arc label usually represents
a subject dependent of a sentential verb.
For the cases where we have N -best parse hypotheses, we use the associated parse prob-
abilities (or condences) to compute expected counts. The sentence will then be represented
with more tuples, corresponding to alternative analyses. For example, if the N -best parses
include two dierent roles for dependent \Basra", then two dierent dl tuples are included,
each with the weighted count that is the sum of the condences of all parses having the
respective role.4
The parse condence ~p is normalized so that the N -best condences sum to one. Because
the parser is overcondent, we explore a
attened estimate: ~p(k) = p(k)
P
i
p(i)
; where k; i index
the parse and
is a free parameter.
4 The use of expectations with N -best parses is dierent from d 50 and d 50 pm in Owczarzak et al.
[2007a], in that the latter uses the best-matching pair of trees rather than an aggregate over the tree sets
and they do not use parse condences.
Table 4.1: Per-segment correlation with human
uency/adequacy judgements of dierent
combination methods and decompositions.
metric r
BLEU4 0.218
F [1g; 2g; dl; lh] 0.237
PR[1g; 2g; dl; lh] 0.217
F [1g; 2g] 0.227
PR[1g; 2g] 0.215
F [1g; dl; dlh] 0.227
F [dl; lh] 0.226
PR[dl; lh] 0.208
Parse condence. The distribution
attening parameter is varied from
= 0 (uni-
form distribution) to
= 1 (no
attening).
Score combination. Global F vs. component harmonic mean PR.
4.4.1 Choosing a combination method: F vs. PR
In table 4.1, we compare combination methods for a variety of decompositions. These
results demonstrate that F consistently outperforms PR as well as the BLEU4 baseline
(see table 4.2). PR measures are never better than BLEU; PR combinations are thus not
considered further in this work.
4.4.2 Choosing a set of decompositions
Considering only the 1-best parse, we compare DPM with dierent decompositions to the
baseline measures. Table 4.2 shows that all decompositions except [dlh] have a better
per-segment correlation with the
uency/adequacy scores than TER or BLEU4. Includ-
ing progressively larger chunks of the dependency graph with F [1g; dl; dlh], inspired by the
edit costs.
The experiments here use the TERp optimizer but extend the set of subscores by includ-
ing the syntactic and n-gram overlap features (modied to re
ect false and missed detection
rates for the TERp format rather than precision and recall). The subscores explored include:
E : the 8 fully syntactic subscores from the DPM family, including false/miss error rates
for the expected values of dl, lh, dlh, and dh decompositions.
N : the 4 n-gram subscores from the DPM family; specically, error rates for the 1g and
2g decompositions.
T : the 11 subscores from TERp, which include matches, insertions, deletions, substitu-
tions, shifts, synonym and stem matches, and four paraphrase edit scores.
For these experiments, we again use the GALE 2.5 data, but with 2-fold cross-validation
in order to have independent tuning and test data. Documents are partitioned randomly,
such that each subset has the same document distribution across source-language and genre.
As in section 4.5.2, the objective is length-normalized per-sentence correlation with HTER,
using mean-removed scores as before. In gure 4.4, we plot the Pearson's r (with 95%
condence interval) for the results on the two test sets combined, after linearly normalizing
the predicted scores to account for magnitude dierences in the learned weight vectors. The
baseline scores, which involve no tuning, are not normalized.
The left side of gure 4.4 shows that TER and EDPM are signicantly more correlated
with HTER than BLEU when measured in this dataset, which is consistent with the overall
results of the previous section. It is also worth noting that the N+E combination is not
equivalent to EDPM (though it has the same decompositions of the syntactic tree), but
EDPM's combination strategy yields a more robust r correlation with HTER. The N+E
combination outperforms E alone (i.e. it is helpful to use both n-gram and dependency
overlap) but gives lower performance than EDPM because of the particular combination
technique. Both ndings are consistent with the
uency/adequacy experiments in sec-
tion 4.4. The TERp features (T in gure 4.4), which account for synonym/paraphrase
Figure 4.4: Pearson's r for various feature tunings, with 95% condence intervals. EDPM,
BLEU and TER correlations are provided for comparison.
dierences, have much higher correlation with HTER than the syntactic E+N subscores.
However, a signicant additional improvement is obtained by adding syntactic features to
TERp (T+E). Adding the n-gram features to TERp (T+N) gives almost as much improve-
ment, probably because most dependencies are local. There is no further gain from using
all three subscore types.
4.7 Discussion
In summary, this chapter introduces the DPM family of dependency pair match measures.
Through a corpus of human
uency and adequacy judgements, we select EDPM, a member
of that family with promising predictive power. We nd that EDPM is superior to BLEU4
and TER in terms of correlation with human
uency/adequacy judgements and as a per-
document and per-sentence predictor of mean-normalized HTER. We also experiment with
including syntactic (EDPM-style) features and synonym/paraphrase features in a TERp-
style linear combination, and nd that the combination improves correlation with HTER
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


