Sign up & Download
Sign in

Some Aspects of Parsing Expression Grammar

by Roman R Redziejowski
Fundamenta Informaticae (2008)

Abstract

Parsing Expression Grammar (PEG) is a new way to specify syntax, by means of a top-down process with limited backtracking. It can be directly transcribed into a recursive-descent parser. The parser does not require a separate lexer, and backtracking removes the usual LL(1) constraint. This is convenient for many applications, but there are two problems: PEG is not well understood as a language specification tool, and backtracking may result in exponential processing time. The paper consists of two parts that address these problems. The first part is an attempt to find out the language actually defined by a given parsing expression. The second part reports measurements of backtracking activity in a PEG-derived parser for the programming language C.

Cite this document (BETA)

Available from iospress.metapress.com
Page 1
hidden

Some Aspects of Parsing Expression Grammar

Some Aspects of Parsing Expression Grammar
Roman R. Redziejowski
February 12, 2008 ∗†
Abstract
Parsing Expression Grammar (PEG) is a new way to specify syntax, by means of a top-down
process with limited backtracking. It can be directly transcribed into a recursive-descent parser.
The parser does not require a separate lexer, and backtracking removes the usual LL(1) constraint.
This is convenient for many applications, but there are two problems: PEG is not well understood
as a language specification tool, and backtracking may result in exponential processing time. The
paper consists of two parts that address these problems. The first part is an attempt to find out the
language actually defined by a given parsing expression. The second part reports measurements of
backtracking activity in a PEG-derived parser for the programming language C.
1 Introduction
Parsing Expression Grammar (PEG) is a new way to specify syntax, recently introduced by Ford [6–8].
The grammar is a formal description of a recursive-descent parser with limited backtracking.
Recursive-descent parsers have been around for a while. Already in 1961, Lucas [13] suggested the
use of recursive procedures that reflect the syntax of the language being parsed. His design did not
allow backtracking; an explicit assumption about the syntax (section 1.13211) was identical to what
later became known as LL(1). The great advantage of recursive-descent parsers is transparency: the
code closely reflects the grammar, which makes it easy to maintain and modify. However, manipulating
the grammar to force it into the LL(1) mold can make the grammar itself unreadable. The use of
backtracking removes the LL(1) restriction. Complete backtracking, meaning an exhaustive search of all
alternatives, may require an exponential time. A reasonable compromise is limited backtracking, also
called ”fast-back” in [12]. In that approach, we discard further alternatives once a sub-goal has been
recognized.
Limited backtracking was adopted in at least two of the early top-down designs: the Atlas Compiler
Compiler of Brooker and Morris [5, 20], and TMG (the TransMoGrifier) of McClure [15]. The syntax
specification used in TMG was later formalized and analyzed by Birman and Ullman [3, 4]. It appears
in [2] as ”Top-Down Parsing Language” (TDPL) and ”Generalized TDPL” (GTDPL). Parsing Expression
Grammar is a development of this latter.
Parsing Expression Grammar is designed for a unified syntax definition, that does not require a separate
”lexer” or ”scanner”. Together with the lifting of the LL(1) restriction, this gives a very convenient tool
when we need an ad-hoc parser for some application.
One problem with PEG is just its new approach to specify syntax. Although PEG looks very much
like the Extended Backus-Naur Form (EBNF), it is not EBNF. It is an algorithm, and defines whatever
that algorithm happens to accept. Writing PEG for a language specified in EBNF offers many surprises
and is basically a trial-and-error process. It has to be better understood. After a brief introduction to
PEG in Section 2, we try, in Section 3, to find out what is actually defined by a given parsing expression.
Another problem with PEG is just the backtracking. Even the limited backtracking may require a lot
of time. In [6,7], PEG was introduced together with a technique called packrat parsing. Packrat parsing
∗Appeared in Fundamenta Informaticae 85, 1-4 (2008) 441–454.
†Preliminary version appeared in Proceedings of CS&P’2007 (Sept 2007) 594–605.
1
Page 2
hidden
handles backtracking by extensive memoization: storing all results of parsing procedures. It guarantees
linear parsing time at a large memory cost1.
Excessive backtracking does not matter in small interactive applications such as [18], where the input is
short and performance not critical. But, the author had a feeling that the usual programming languages
do not require much backtracking. The feeling was based on the observation that these languages have
large LL(1) parts, and that limited backtracking prevents the parser from going back farther than to the
beginning of a statement. An experiment reported in [19] indeed demonstrated a moderate backtracking
activity in a PEG parser for Java 1.5. Section 4 reports similar results for the programming language C.
2 Parsing Expression Grammar
Parsing Expression Grammar is a set of named parsing expressions. They are specified by rules of the
form A = e where e is a parsing expression and A is the name given to it. Parsing expressions are
instructions for parsing strings. When applied to a character string, parsing expression tries to match
initial portion of that string, and may ”consume” the matched portion. It may then indicate ”success”
or ”failure”.
Figure 1 lists all forms of parsing expressions. Each of e, e1, . . . , en in the Figure is a parsing expression,
specified either explicitly or by its name. Subexpressions may be enclosed in parentheses to indicate the
order of applying the operators. In the absence of parentheses, the operators appearing lower in the table
have precedence over those appearing higher. Note the backtracking involved in three constructions: the
sequence and the two predicates.
e1/ . . . /en Ordered choice: Apply expressions e1, . . . , en, in this order, to the text ahead, until
one of them succeeds and possibly consumes some text. Indicate success if one of
expressions succeeded. Otherwise do not consume any text and indicate failure.
e1 . . . en Sequence: Apply expressions e1, . . . , en, in this order, to consume consecutive
portions of the text ahead, as long as they succeed. Indicate success if all succeeded.
Otherwise do not consume any text and indicate failure.
&e And predicate: Indicate success if expression e matches the text ahead; otherwise
indicate failure. Do not consume any text.
!e Not predicate: Indicate failure if expression e matches the text ahead; otherwise
indicate success. Do not consume any text.
e+ One or more: Apply expression e repeatedly to match the text ahead, as long as it
succeeds. Consume the matched text (if any) and indicate success if there was at
least one match. Otherwise indicate failure.
e∗ Zero or more: Apply expression e repeatedly to match the text ahead, as long as it
succeeds. Consume the matched text (if any). Always indicate success.
e? Zero or one: If expression e matches the text ahead, consume it. Always indicate
success.
[ s ] Character class: If the character ahead appears in the string s, consume it and
indicate success. Otherwise indicate failure.
[ c1-c2 ] Character range: If the character ahead is one from the range c1 through c2, consume
it and indicate success. Otherwise indicate failure.
”s” String: If the text ahead is the string s, consume it and indicate success. Otherwise
indicate failure.
Any character: If there is a character ahead, consume it and indicate success.
Otherwise (that is, at the end of input) indicate failure.
Fig.1.
1”Packrat” comes from pack rat – a small rodent (Neotoma cinerea) known for hoarding unnecessary items; also a person
that does the same. ”Memoization”, introduced in [16], is the technique of reusing stored results of function calls instead
of recomputing them.
2

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

3 Readers on Mendeley
by Discipline
 
 
by Academic Status
 
33% Other Professional
 
33% Post Doc
 
33% Ph.D. Student
by Country
 
33% Sweden
 
33% Switzerland
 
33% Belarus

Groups

pool