| Parsing Expression Grammar |
Article Index for Parsing |
Website Links For Expression |
Information AboutParsing Expression Grammar |
| CATEGORIES ABOUT PARSING EXPRESSION GRAMMAR | |
| formal languages | |
| parsing algorithms | |
| SHOPPER'S DELIGHT | |
|
Unlike CFGs, PEGs are unambiguous; if a string parses, it has exactly one valid Parse Tree . This suits PEGs well to parsing computer languages, but not Natural Language s. DEFINITION Formally, a parsing expression grammar consists of:
Each parsing rule in ''P'' has the form ''A'' ← ''e'', where ''A'' is a nonterminal and ''e'' is a ''parsing expression''. A parsing expression is a hierarchical Expression similar to a Regular Expression , which is constructed in the following fashion: # An ''atomic parsing expression'' consists of:
# Given any existing parsing expressions ''e'', ''e''1, and ''e''2, a new parsing expression can be constructed using the following operators:
Unlike in a Context-free Grammar or other Generative Grammar s, in a parsing expression grammar there must be ''exactly one rule'' in the grammar having a given nonterminal on its left-hand side. That is, rules act as ''definitions'' in a PEG, and each nonterminal must have one and only one definition. Interpretation of parsing expressions Each nonterminal in a parsing expression grammar essentially represents a parsing Function in a Recursive Descent Parser , and the corresponding parsing expression represents the "code" comprising the function. Each parsing function conceptually takes an input string as its argument, and yields one of the following results:
A nonterminal may succeed without actually consuming any input, and this is considered an outcome distinct from failure. An atomic parsing expression consisting of a single terminal succeeds if the first character of the input string matches that terminal, and in that case consumes the input character; otherwise the expression yields a failure result. An atomic parsing expression consisting of the empty string always trivially succeeds without consuming any input. An atomic parsing expression consisting of a nonterminal ''A'' represents a Recursive call to the nonterminal-function ''A''. The sequence operator ''e''1 ''e''2 first invokes ''e''1, and if ''e''1 succeeds, subsequently invokes ''e''2 on the remainder of the input string left unconsumed by ''e''1, and returns the result. If either ''e''1 or ''e''2 fails, then the sequence expression ''e''1 ''e''2 fails. The choice operator ''e''1 / ''e''2 first invokes ''e''1, and if ''e''1 succeeds, returns its result immediately. Otherwise, if ''e''1 fails, then the choice operator Backtracks to the original input position at which it invoked ''e''1, but then calls ''e''2 instead, returning ''e''2's result.
Finally, the and-predicate and not-predicate operators implement Syntactic Predicate s. The expression &''e'' invokes the sub-expression ''e'', and then succeeds if ''e'' succeeds and fails if ''e'' fails, but in either case ''never consumes any input''. Conversely, the expression !''e'' succeeds if ''e'' fails and fails if ''e'' succeeds, again consuming no input in either case. Because these can use an arbitrarily complex sub-expression ''e'' to "look ahead" into the input string without actually consuming it, they provide a powerful syntactic Lookahead and disambiguation facility. Examples This is a PEG for mathematical formulas that use the basic four operations: Value
Expr S A B The following recursive rule matches standard C-style if/then/else statements in such a way that the optional "else" clause always binds to the innermost "if", because of the implicit prioritization of the '/' operator. (In a Context-free Grammar , this construct yields the classic Dangling Else Ambiguity .) S The parsing expression foo &(bar) matches and consumes the text "foo" but only if it is followed by the text "bar". The parsing expression foo !(bar) matches the text "foo" but only if it is ''not'' followed by the text "bar". The expression !(a+ b) a matches a single "a" but only if it is not the first in an arbitrarily-long sequence of a's followed by a b. Z IMPLEMENTING PARSERS FROM PARSING EXPRESSION GRAMMARS Any parsing expression grammar can of course be converted directly into a Recursive Descent Parser . Due to the unlimited Lookahead capability that the grammar formalism provides, however, the resulting parser could exhibit Exponential Time performance in the worst case. By Memoizing the results of intermediate parsing steps and ensuring that each parsing function is only invoked at most once at a given input position, however, it is possible to convert any parsing expression grammar into a Packrat Parser , which always runs in Linear Time at the cost of substantially greater storage space requirements. It is also possible to build LL Parser s and LR Parser s from parsing expression grammars, but the unlimited lookahead capability of the grammar formalism is lost in this case. ADVANTAGES PEGs make a good replacement for Regular Expressions , because they are strictly more powerful. For example, a regular expression inherently cannot find matched pairs of parentheses, because it is not recursive, but a PEG can. Any PEG can be parsed in linear time by using a Packrat Parser , as described above. Parsers for languages expressed as a CFG, such as LR Parser s, require a separate Tokenization step to be done first, which breaks up the input based on the location of spaces, punctuation, etc. The tokenization is necessary because of the way these parsers use ''lookahead'' to parse CFGs that meet certain requirements in linear time. PEGs do not require tokenization to be a separate step, and tokenization rules can be written in the same way as any other grammar rule. Many CFGs contain inherent ambiguities, even when they're intended to describe unambiguous languages. The " Dangling Else " problem in C, C++, and Java is one example. These problems are often resolved by applying a rule outside of the grammar. In a PEG, these ambiguities never arise, because of prioritization. DISADVANTAGES PEGs are very new, and are not widely implemented. Regular expressions and CFGs have the advantage that they have been around for decades, the code to parse them has been extremely optimized, and many programmers are already familiar with how to use them. PEGs cannot express '' Left-recursive '' rules, where a rule refers to itself without moving forward in the string. For example, in the arithmetic grammar above, it would be tempting to move some rules around so that the precedence order of products and sums could be expressed in one line: Value Expr The problem with this is that it says that to match an Expr, you need to first see if a Product matches there, and to match a Product, you need to first see if an Expr matches there. This is not possible. However, left-recursive rules can always be rewritten to not be left-recursive. In the simplest case, for example, a left-recursive rule repeats a certain expression indefinitely, as in the CFG rule |