Sign up & Download
Sign in

Parallel Multi-Theory Annotations of Syntactic Structure

by Jerid Francom, Mans Hulden
Proceedings of the 6th international Conference on Language Resources and Evaluation (2008) (2008)

Cite this document (BETA)

Available from Jerid Francom's profile on Mendeley.
Page 1
hidden

Parallel Multi-Theory Annotations of Syntactic Structure

Parallel multi-theory annotation of syntactic structure
Jerid Francom and Mans Hulden
The University of Arizona
Tucson, AZ 85721
{jeridf,mhulden}@email.arizona.edu
Abstract
We present an approach to creating a treebank of sentences using multiple notations or linguistic theories simultaneously. We illustrate
the method by annotating sentences from the Penn Treebank II in three different theories in parallel: the original PTB notation, a
Functional Dependency Grammar notation, and a Government and Binding style notation. Sentences annotated with all of these theories
are represented in XML as a directed acyclic graph where nodes and edges may carry extra information depending on the theory encoded.
1. Introduction
Following the establishment of various treebanks for an
assortment of languages, the past few years have been
marked by strong interest in the development of parallel
treebanks—that is, treebanks that simultaneously annotate
sentences in two or more languages. For machine transla-
tion purposes in particular, this effort seems to prove quite
valuable. In contrast to such multi-language annotation
work, far less attention has been given to multi-theory anno-
tations: annotating treebanks in parallel following different
linguistic theories.
The current work highlights limitations in current treebank
formulations that potentially restrict the accessibility and
overall usefulness of treebanks and suggests that such lim-
itations can be reduced by incorporating syntactic annota-
tion for a set of sentences in multiple theories. Furthermore,
a multi-theory treebank creates a host of novel opportuni-
ties that cannot be addressed under current approaches to
corpus annotation. Yet the practical problems surround-
ing such multi-theory treebanks are quite different in na-
ture from those seen in multi-language treebanks. In what
follows, we shall present the arguments supporting the cre-
ation of a multi-theory treebank, an elaborated review of
the properties to consider in annotating sentences in a va-
riety of theories simultaneously, provide a set of examples
in various theoretical frameworks, and discuss the difficul-
ties one faces in the development of such a multi-purpose
treebank.
2. Treebanks and linguistic theory
Treebanks such as the Penn Treebank (Marcus et al., 1994)
have become a rich source of data applicable to research
in theoretical, corpus and computational linguistics. The
syntactic annotation supplied in these sources allows the
researcher the ability to extract sub-sentential information
from a range of sentence types for a wide variety of appli-
cations.
However, the current set of treebanks is not without short-
comings. First, very few treebanks in wide use today over-
lap with respect to the collection of sentences that have
been annotated.1 The data used in a given treebank may not
be optimal for all types of research due to the nature of the
1The larger richly annotated treebanks available for English
source.2 This, in turn, hinders the potential of any given
treebank to serve as a general-purpose resource. Also,
many treebanks either assume a theory-neutral approach to
annotation or adopt some very specific theoretical frame-
work. In some cases, the lack of detailed theory-specific in-
formation has been overcome by adding more refined anno-
tations to existing generic treebanks (Honnibal and Curran,
2005; Hockenmaier and Steedman, 2007). Also, particular
theoretical questions may not be easily pursued with exist-
ing treebanks due to the specific needs of the researcher,
despite the attractiveness of the data on other grounds (i.e.
text-source, language, size, etc.) Taken together, the overall
usefulness of treebanks may not be fully realized under the
current formulation.
Our proposal aims to fill this apparent gap and create
novel research opportunities by elaborating an encoding
scheme for a ‘parallel multi-theory’ treebank—that is, a
treebank which simultaneously annotates sentences in two
or more theoretical frameworks. With the popularity of
multi-language treebanks and the growth of more reliable
parsing methods that require less manual intervention than
before, the possibility of annotating a single collection of
sentences using different theories is becoming feasible with
much less effort than in the past.
A multi-theoretical approach to treebanks shows promise
in overcoming those aspects of current treebanks that re-
strict their overall convenience as a tool for research. This
approach creates an alternative to the theory neutral vs. the-
ory specific issue of annotation while providing the ability
to extrapolate information over the same data set regardless
of the researcher’s theoretical flavor of interest.3 Further-
more, given explicit correspondences between theoretical
machinery, a multi-theory collection also provides a host of
other potential applications that benefit from the opportu-
nity to ask questions about the data in one theory relative to
such as the Penn treebank (Bies et al., 1995) and LinGO Red-
woods (Oepen et al., 2004) all annotate different sets of sentences.
2Treebanks vary on a variety of properties that may render
an particular dataset less-than-optimal, including the text source:
newspaper, web, conversation, etc.; language, relative size, etc.
3Overall comparability is enhanced by maintaining the same
set of sentences across the different theories, whereas treebanks
that work with different sentences are not essentially as directly
comparable.
Page 2
hidden
NP
Det
The
N
company
<NP>
<Det><word="The"></Det>
<N><word="company"></N>
</NP>
Figure 1: A potential simple XML approach that allows
only phrase structure modeling.
another theory, or another version of the same theory, that
are not feasible without the elaboration of this ty pe of data
set.
Advances in parallel treebank construction demonstrate the
possibility of parallel annotation of a single collection of
data (Samuelsson and Volk, 2005). However, the encod-
ing of a collection of the sort proposed here poses prac-
tical questions about how to maximize its usefulness that
are somewhat distinct from those found in multi-language
treebanks. At their most basic level both a multi-language
and multi-theory treebank must encode sub-sentential units
and the relative correspondences between each of these in
a parallel dataset. Yet in addition to this level of descrip-
tion and the challenges posed therein, a multi-theory tree-
bank must also create an exchange format that is capable
of encoding more detailed properties of a linguistic theory
in general and the nuances of particular theories while at
the same time providing the substrate for further theoretical
augmentation. A more detailed description of the proper-
ties to model and the encoding schema employed in this
appr oach follow.
3. Properties to model
In encoding our exchange format in XML for simultaneous
annotation of sentences in multiple theories we have tried
to exploit the following common ground between most the-
ories: trees, dependency structures, collections of features
and values, etc., can all be considered elements of a directed
acyclic graph (DAGs). A standard phrase-structure tree, as
seen in figure 1, could, of course, for the most part, be en-
coded directly into XML by embedding one structure type
inside another. However, by doing so, one would lose the
ability to handle crossing constituents, dependency gram-
mars, and the like. As noted by Mengel and Lezius (2000),
this can be remedied simply by using the XML format to
specify a number of nodes and words, which may be linked
to each other—thus allowing one to represent crossing de-
pendencies.
In addition to simply regarding a theoretical framework as a
collection of nodes and edges, one would also like to retain
the hierarchical structure imposed by formats such as XML
in order to take advantage of the myriad of tools available
for handling XML-encoded data. Also, in order to be as
flexible as possible, a multi-theory annotation should share
as many generic theoretical devices as possible, while al-
lowing for differences only when absolutely required.
As an exploratory framework, we have chosen three differ-
ent theories to encode simultaneously: the standard Penn
Treebank II format (Bies et al., 1995) (from which our sen-
tences were also taken), functional dependency grammar
<s id="s1">
<words>
<word id="w1_0" orth="The"/>
<word id="w1_1" orth="company"/>
<word id="w1_2" orth="can"/>
<word id="w1_3" orth="go"/>
<word id="w1_4" orth="about"/>
<word id="w1_5" orth="its"/>
<word id="w1_6" orth="business"/>
</words>
<theories>
...
</theories>
</s>
Figure 2: The basic XML encoding for individual words.
(FDG) (Tapanainen and Ja¨rvinen, 1997), and the generative
Government and Binding theory (where we assume Haege-
man (1994) to be the standard GB reference).
Together these representations span a variety of theoreti-
cal machinery that needs to be captured in an annotation.
While the PTB structure is almost purely phrase-structure
based, GB theory requires the modeling of a number of
non-phrase-structure elements: movement and traces, co-
indexation, theta-roles (through which subcategorization
information is expressed), and feature-value combinations,
such as those relating to case, person, number, etc. The
functional dependency grammar again requires that it be
possible that a set of words depend on each other in vari-
ous ways; dependency grammars generally rely on classi-
fying dependencies between words into different types as
opposed to phrase structure grammars where there is only
a single type of edge between nodes (representing a parent-
child node relation).
Despite this plurality, we have made the following simplify-
ing assumptions to bring different models as close together
as possible:
• Phrase structure trees and dependency structures are
DAGs with nodes possibly having edges to the ortho-
graphic representation of a string of words
• Nodes and edges may carry a set of properties that
are theory-specific (although, for FDG, the only edge-
related property used is that of “label”).
3.1. Example annotations
In our model, every sentence shares the same set of
words—i.e. the orthographic representation, which can be
seen as a preamble to the more theory-specific annotations
that follow. For instance, the sentence from the PTB The
company can go about its business (see figure 6 for the PTB
representation), would entail an encoding as in figure 2.
The identifiers of the words simply serve as possible tar-
gets for nodes in the more theory-specific section that
follows the <words> section. This orthographic sec-
tion is followed by a specific section for each theory in-
cluded in the encoding. The following <theory> ...
Page 3
hidden
</theory> section is largely identical in format for
all theories as regards nodes and edges. However, each
<node> section contains a theory-specific section defin-
ing the properties of a particular node as illustrated in the
following examples.
In figure 3, we see a GB tree of the above example sentence
with the XML-representation of node I illustrating a) the
various “properties” the node carries: tense, person, num-
ber, co-indexation, labeling, and b) an edge from the node
to the orthographic index, and c) how the phrase-structure
tree is encoded by directed edges.4
Similarly, figure 4 illustrates the corresponding FDG-
annotated graph, where the node associated with the first
word ‘The’ is seen in XML.5 Despite the superficially dif-
ferent theories of GB and FDG, the only difference in this
FDG encoding and the GB one is is that a) an edge in
FDG can carry a “label” property, and b) the entities in the
<properties> section are only of two kinds: functions,
and parts-of-speech (pos).
For the sake of completeness, figure 6 contains the origi-
nal Penn treebank sentence encoded according to the same
scheme.
3.1.1. A note on the definition of ‘word’
The method used here, where words in the orthographic
sentence carry a one-to-many relationship with node ele-
ments, is required for cases where a theory splits up the
orthographic word into several elements. The inverse rela-
tionship, a many-to-one relationship from words to nodes
is required when a theory collapses several words into one
node. Both scenarios are somewhat rare in English (but fre-
quent elsewhere) as most major theoretical approaches have
a rough agreement on the “status” of the word unit. How-
ever, a simple example of the first type in English is found
in negation contractions, where a ’t element is considered
a separate word in many theories, but is contracted with an
auxiliary in the orthographic representation. In this case,
the orthographic word may be pointed to by two different
nodes in a tree.
The second type of relationship could be the result of an
analysis of phrasal verbs (as the go about example used
earlier)—where a phrasal verb consisting of two separate
orthographic words is associated with only one node. Ob-
viously, both scenarios are encodable without allowing for
edges from words to nodes, but simply by declaring that a
node may consist of several edges that connect to words,
and that many nodes can connect to just one word.
Whether or not this approach provides a sufficiently fine-
grained annotation obviously depends on the theory and the
language: for a highly agglutinating language where a long
word may be pointed to by a large number of nodes, one
may want to add a morpheme-layer between the level of
node and word, where a node would point to a morpheme,
4Note that the trace and corresponding movement of the V-
node is captured by the index number on that node, which will
also be present on the lower Vt node. The information about the
presence of a trace and its movement is thus recoverable from the
indices, and need not be listed as a separate property.
5Sentences from the Penn treebank were analyzed for FDG
with Connexor’s Machinese “tagger”.
IP
NPi
[+nom]
det
the
N’
N
company
I’
I
cant
[+pres]
[+1prs]
[sg]
VP
V’
Vt VP
V
go
PP
P’
P
about
NP
[+acc]
NPi
its
N’
N
business
<theories>
<theory_gb>
<node id="n1_5" ref="w1_2">
<properties>
<property="label" value="I"/>
<property="tense" value="pres"/>
<property="person" value="1"/>
<property="number" value="sg"/>
<property="index" value="1"/>
</properties>
</node>
...
</theory_gb>
...
</theories>
Figure 3: The example sentence in GB theory with encoding
details shown for the node I.
and a morpheme be a part of a word. For our purposes
(dealing with English primarily) we have refrained from
this additional layer for the sake of simplicity.
4. Potential applications
As discussed in Section 2., a multi-theory approach to tree-
banks alleviates some of the limitations associated with
‘traditional’ treebanks by providing a more general-purpose
data source. However, we also anticipate a number of fruit-
ful applications of multi-theory treebank annotations that
either enhance existing enterprises or break new ground
where investigation was not possible.
In corpus research, it may be the case that no one partic-
ular theory is fine-grained enough in some limited area to
provide a method for searching for some phenomenon of
interest. This issue then lies in the particular perceived lim-
Page 4
hidden
comp
det subj v-ch ha attr
The company can go about its business
@DN @SUBJ @+FAUXV @-FMAINV @ADVL @A> @-PCOMPL-S
N DET NH N NOM AUX V AUXMOD VA V INF EH ADV N PRON GEN SG3 NH N NOM SG
<theories>
<theory_fdg>
<node id="n1_0" ref="w1_0">
<edge id="e1_1" ref="n1_1"
label="det"/>
<properties>
<property="function"
value="@DN"/>
<property="pos" value="N DET"/>
</properties>
</node>
...
</theory_fdg>
...
</theories>
Figure 4: The example sentence in FDG with encoding de-
tails shown for the node associated with the word The.
S
NP-SBJ
DT
The
NN
company
VP
MD
can
VP
VB
go
PP-CLR
IN
about
NP
PRP$
its
NN
business
Figure 5: The example sentence from the original PTB.
<theories>
<theory_ptb2>
<node id="n1_0" ref="w1_0">
<edge id="e1_1" ref="n1_1">
<properties>
<property="label" value="S">
</properties>
</node>
...
</theory_ptb2>
...
</theories>
Figure 6: The example sentence in PTB II with encoding
details shown for the node S.
itations of the theory (not the treebank) in question. In this
case, allowing for combined queries over several different
syntactic annotations may be profitable in that the com-
bined properties may in fact produce nuances and search
possibilities that otherwise would not be easily formulated.
Also, for grammar induction purposes, we foresee that of-
fering a selection of annotations to choose from and test
over may be of interest. Recent works on grammar in-
duction on the PTB have chosen an interesting approach
to discard the original functional tags (such as NP-SBJ and
PP-CLR in figure 6) because of their perceived “negative
utility” in the learning task. Instead of the original tags,
the bare syntactic tags have been enriched through what
is called “parent annotation,” first introduced in Johnson
(1998). Somewhat surprisingly, this kind of automatic
“functional tagging” to replace the original PTB tags has
turned out to perform very well, as shown in Klein and
Manning (2003), who attain nearly state-of-the-art parser
performance without lexicalization through a few system-
atic modifications to the original annotations. Along these
lines, we also see automatic enrichments on existing tree-
banks to be a possible target for paral lel annotation, where
one could annotate a collection of sentences with minor
variants of the same notation simultaneously.
Maybe most the notable novel application that a multi-
theory treebank provides for is found within theoretical lin-
guistics. One can construct tools for theoretical research
which would allow a user to query a treebank about how
some element in theory X is represented in theory Y . Us-
ing such a tool, researchers would be in a unique position to
evaluate particular theoretical implications of a given prop-
erty and evaluate the coverage of a grammar. Indirectly,
a collection of this sort encourages researchers to formal-
ize proposed theories more rigorously and at the same time
allows for more objective evaluation of tentative new theo-
retical ideas.
5. References
A. Bies, M. Ferguson, K. Katz, and R. MacIntyre. 1995.
Bracketing Guidelines for Treebank II Style Penn Tree-
bank Project.
L. Haegeman. 1994. Introduction to Government and
Binding Theory. Blackwell Publishing.
J. Hockenmaier and M. Steedman. 2007. CCGbank: A
Corpus of CCG Derivations and Dependency Structures
Extracted from the Penn Treebank. Computational Lin-
guistics, 33(3).
M. Honnibal and J. R. Curran. 2005. Creating a Systemic
Functional Grammar Corpus from the Penn Treebank.
ACL 2007, 100:89–96.
M. Johnson. 1998. PCFG models of linguistic tree repre-
sentations. Computational Linguistics, 24(4):613–632.
D. Klein and C. D. Manning. 2003. Accurate unlexicalized
parsing. Proceedings of the 41st Annual Meeting of the
Association for Computational Linguistics, pages 423–
430.
M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz.
1994. Building a Large Annotated Corpus of En-
glish: The Penn Treebank. Computational Linguistics,
19(2):313–330.
Page 5
hidden
A. Mengel and W. Lezius. 2000. An XML-based represen-
tation format for syntactically annotated corpora. Pro-
ceedings of the Second International Conference on Lan-
guage Resources and Engineering (LREC), 1:121–126.
S. Oepen, D. Flickinger, K. Toutanova, and C. D. Man-
ning. 2004. LinGO Redwoods. Research on Language
& Computation, 2(4):575–596.
Y. Samuelsson and M. Volk. 2005. Presentation and rep-
resentation of parallel treebanks. Proc. of the Treebank-
Workshop at Nodalida.
P. Tapanainen and T. Ja¨rvinen. 1997. A non-projective de-
pendency parser. Proceedings of the 5th Conference on
Applied Natural Language Processing, pages 64–71.

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

3 Readers on Mendeley
by Discipline
 
by Academic Status
 
67% Student (Master)
 
33% Assistant Professor
by Country
 
33% Germany
 
33% Poland
 
33% United States