Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Cooper K.Engineering a compiler

.pdf
Скачиваний:
49
Добавлен:
23.08.2013
Размер:
2.31 Mб
Скачать

3.2. EXPRESSING SYNTAX

59

The classic example of an ambiguous construct in the grammar for a programming language arises in the definition of the if-then-else construct found in many Algol-like languages. The straightforward grammar for if-then-else might be:

Stmt if ( Expr ) then Stmt else Stmt

|if ( Expr ) then Stmt

| Assignment

|. . .

This fragment shows that the else part is optional. Unfortunately, with this grammar the code fragment

if (Expr 1) then if (Expr 2) then Stmt 1 else Stmt 2s

has two distinct derivations. The di erence between them is simple. Using indentation to convey the relationship between the various parts of the statements, we have:

if (Expr1)

then if (Expr2) then Stmt1 else Stmt2

if (Expr1)

then if (Expr2) then Stmt1

else Stmt2

The version on the left has Stmt2 controlled by the inner if statement, so it executes if Expr1 is true and Expr2 is false. The version on the right associates the else clause with the first if statement, so that Stmt2 executes if Expr1 is false (independently of Expr2). Clearly, the di erence in derivation will produce di erent behavior for the compiled code.

To remove this ambiguity, the grammar must be modified to encode a rule for determining which if controls an else. To fix the if-then-else grammar, we can rewrite it as:

Stmt WithElse

|LastElse

WithElse if ( Expr ) then WithElse else WithElse

|Assignment

|other statements . . .

LastElse if ( Expr ) then Stmt

|if ( Expr ) then WithElse else LastElse

The solution restricts the set of statements that can occur in the then-part of an if-then-else construct. It accepts the same set of sentences as the original grammar, but ensures that each else has an unambiguous match to a specific if. It encodes into the grammar a simple rule—bind each else to the innermost unclosed if.

60

CHAPTER 3. PARSING

This ambiguity arises from a simple shortcoming of the grammar. The solution resolves the ambiguity in a way that is both easy to understand and easy for the programmer to remember. In Section 3.6.1, we will look at other kinds of ambiguity and systematic ways of handling them.

3.2.3Encoding Meaning into Structure

The if-then-else ambiguity points out the relationship between grammatical structure and meaning. However, ambiguity is not the only situation where grammatical structure and meaning interact. Consider again the derivation

tree for our simple expression, Number +

Number × Number.

 

 

 

Expr

 

 

 

Q

 

 

?QsQ

 

 

Expr

Op Number3

 

 

Q

 

 

+

 

?QsQ

 

?

Expr

Op Number2 ×

??

Number1 +

We have added subscripts to the instances of Number to disambiguate the discussion. A natural way to evaluate the expression is with a simple postorder treewalk. This would add Number1 and Number2 and use that result in the multiplication with Number3, producing (Number1+Number2)×Number3 This evaluation contradicts the rules of algebraic precedence taught in early algebra classes. Standard precedence would evaluate this expression as

Number1 + (Number2 × Number3).

The expression grammar should have the property that it builds a tree whose “natural” treewalk evaluation produces this result.

The problem lies in the structure of the grammar. All the arithmetic operators derive in the same way, at the same level of the grammar. We need to restructure the grammar to embed the proper notion of precedence into its structure, in much the same way that we embedded the rule to disambiguate the if-then-else problem.

To introduce precedence into the grammar, we need to identify the appropriate levels of precedence in the language. For our simple expression grammar, we have two levels of precedence: lower precedence for + and , and higher precedence for × and ÷. We associate a distinct non-terminal with each level of precedence and isolate the corresponding part of the grammar.

1. Expr Expr + Term

2.| Expr − Term

3.

 

|

Term

4.

Term

Term × Number

5.| Term ÷ Number

6.| Number

3.2. EXPRESSING SYNTAX

61

Here, Expr represents the lower level of precedence for + and , while Term represents the higher level for × and ÷. Using this grammar to produce a rightmost derivation for the expression Number1 + Number2 × Number3, we get:

Rule Sentential Form

Expr

1Expr + Term

4Expr + Term × Number3

6Expr + Number2 × Number3 3 Term + Number2 × Number3

6Number1 + Number2 × Number3

This produces the following syntax tree:

 

 

Expr

 

 

 

 

 

 

 

 

 

PP

 

 

 

 

 

 

 

 

 

 

+

 

 

?

PPPq

 

Expr

+

Term

 

 

 

 

 

 

 

 

 

Q

 

 

?

 

 

 

+

 

 

?QQs

 

 

 

 

 

Term

 

 

 

Term

× Number3

??

Number1 Number2

A postorder treewalk over this syntax tree will first evaluate Number2×Number3 and then add Number1 to the result. This corresponds to the accepted rules for precedence in ordinary arithmetic. Notice that the addition of non-terminals to enforce precedence adds interior nodes to the tree. Similarly, substituting the individual operators for occurrences of Op removes interior nodes from the tree.

To make this example grammar a little more realistic, we might add optional parentheses at the highest level of precedence. This requires introduction of another non-terminal and an appropriate rewrite of the grammar. If we also add support for both numbers and identifiers, the resulting grammar looks like:

1. Expr Expr + Term

2.| Expr − Term

3.

 

|

Term

4.

Term

Term × Factor

5.| Term ÷ Factor

6.

 

|

Factor

7.

Factor

( Expr )

8.

 

|

Number

9.| Identifier

We will use this grammar as we explore parsing in the next several sections. We will refer to it as the classic expression grammar. In discussing automatic techniques, we may add one more production: GoalExpr. Having a unique goal symbol simplifies some of the algorithms for automatically deriving parsers. For space reasons, we will often abbreviate Number as Num and Identifier as

Id.

62

CHAPTER 3. PARSING

3.2.4Discovering a Specific Derivation

We have seen how to discover sentences that are in L(G) for our grammar G. By contrast, a compiler must infer a derivation for a given input string that, in fact, may not be a sentence. The process of constructing a derivation from a specific input sentence is called parsing. The compiler needs an automatic, algorithmic way to discover derivations. Since derivation trees are equivalent to derivations, we can think of parsing as building the derivation tree from the input string. Since parsing works by constructing a derivation tree, we often call that tree a parse tree.

The root of the derivation tree is fixed; its is a single node representing the goal symbol. The leaves of the tree are determined by the input string; the leaves must match the stream of classified words returned by the scanner. The remaining task is to discover an interior structure for the tree that connects the leaves to the root and is consistent with the rules embodied in the grammar. Two distinct and opposite approaches for constructing the tree suggest themselves.

Top-down parsers begin with the root and proceed by growing the tree toward the leaves. At each step, a top-down parser selects some non-terminal node and extends the tree downward from that node.

Bottom-up parsers begin with the leaves and proceed by growing the tree toward the root. At each step, a bottom-up parser adds nodes that extend the partially-built tree upward.

In either scenario, the parser makes a series of choices about which production to apply. Much of the intellectual complexity in parsing lies in the mechanisms for making these choices.

3.2.5Context-Free Grammars versus Regular Expressions

To understand the di erences between regular expressions and context-free grammars, consider the following two examples.

 

|

 

|

Expr

Expr Op Expr

((ident

num) op) (ident

num)

|

Number

 

 

 

 

op + | − | × |÷

 

 

|

Identifier

 

 

 

 

Op

+ | − | × |÷

where Identifier and Number have their accustomed meanings. Both the re and the cfg describe the same simple set of expressions.

To make the di erence between regular expressions and context-free languages clear, consider the notion of a regular grammar. Regular grammars have the same expressive power as regular expressions—that is, they can describe the full set of regular languages.

A regular grammar is defined as a four-tuple, R = (T, N T, S, P ), with the same meanings as a context-free grammar. In a regular grammar, however, productions in P are restricted to one of two forms: αa, or αaβ, where

3.3. TOP-DOWN PARSING

63

α, β N T and a T . In contrast, a context-free grammar allows productions with right-hand sides that contain an arbitrary set of symbols from (T N T ). Thus, regular grammars are a prOper subset of context-free grammars. The same relationship holds for the regular languages and context-free languages.

(Expressing the di erence as a re, the regular grammar is limited to righthand sides of the form T | T N T , while the context-free grammar allows righthand sides of the form (T | N T ) .)

Of course, we should ask: “are there interesting programming language constructs that can be expressed in a cfg but not a rg?” Many important features of modern programming languages fall into this gap between cfgs and rgs (or res). Examples include matching brackets, like parentheses, braces, and pairs of keywords (i.e., begin and end). Equally important, as the discussion in Section 3.2.3 shows, it can be important to shape the grammar so that it encodes specific information into the parse tree. For example, regular grammars cannot encode precedence, or the structure of an if-then-else construct. In contrast, all of these issues are easily encoded into a context-free grammar.

Since cfgs can recognize any construct specified by a re, why use res at all? The compiler writer could encode the lexical structure of the language directly into the cfg. The answer lies in the relative e ciency of dfa-based recognizers. Scanners based on dfa implementations take time proportional to the length of the input string to recognize words in the language. With reasonable implementation techniques, even the constants in the asymptotic complexity are small. In short, scanners are quite fast. In contrast, parsers for cfgs take time proportional to the length of the input, plus the length of the derivation. The constant overhead per terminal symbol is higher, as well.

Thus, compiler writers use dfa-based scanners for their e ciency, as well as their convenience. Moving micro-syntax into the context-free grammar would enlarge the grammar, lengthen the derivations, and make the front-end slower. In general, res are used to classify words and to match patterns. When higherlevel structure is needed, to match brackets, to impart structure, or to match across complex intervening context, cfgs are the tool of choice.

3.3Top-Down Parsing

A top-down parser begins with the root of the parse tree and systematically extends the tree downward until it matches the leaves of the tree, which represent the classified words returned by the scanner. At each point, the process considers a partially-built parse tree. It selects a non-terminal symbol on the lower fringe of the tree and extends it by adding children that correspond to the right-hand side of some production for that non-terminal. It cannot extend the frontier from a terminal. This process continues until either the entire syntax tree is constructed, or a clear mismatch between the partial syntax tree and its leaves is detected. In the latter case, two possibilities exist. The parser may have selected the wrong production at some earlier step in the process; in which case backtracking will lead it to the correct choice. Alternatively, the input string may not be a valid sentence in the language being parsed; in this case,

64

 

 

 

CHAPTER 3. PARSING

 

token next

 

token

 

root start

 

symbol

 

node root

 

loop forever

if node T & node token then

=

advance node to next node on the fringe token next token

else if node T & node token then

=

backtrack

else if node N T then pick a rule “nodeβ”

extend tree from node by building β node leftmost symbol in β

if node is empty & token = eof then accept

else if node is empty & token = eof then backtrack

Figure 3.1: A leftmost, top-down parsing algorithm

backtracking will fail and the parser should report the syntactic error back to the user. Of course, the parser must be able to distinguish, eventually, between these two cases.

Figure 3.1 summarizes this process. The process works entirely on the lower fringe of the parse tree—which corresponds to the sentential forms we used in the examples in Section 3.2.2. We have chosen to expand, at each step, the leftmost non-terminal. This corresponds to a leftmost derivation. This ensures that the parser considers the words in the input sentence in the left-to-right order returned by the scanner.

3.3.1Example

To understand the top-down parsing algorithm, consider a simple example: recognizing x 2 × y as a sentence described by the classic expression grammar. The goal symbol of the grammar is Expr; thus the parser begins with a tree rooted in Expr. To show the parser’s actions, we will expand our tabular representation of a derivation. The leftmost column shows the grammar rule used to reach each state; the center column shows the lower fringe of the partially constructed parse tree, which is the most recently derived sentential form. On the right, we have added a representation of the input stream. The shows the position of the scanner; it precedes the current input character. We have added two actions, and , to represent advancing the input pointer and backtracking through the set of productions, respectively. The first several moves of the parser look like:

3.3. TOP-DOWN PARSING

 

65

 

Rule or

 

 

 

 

 

 

 

 

 

 

Action

Sentential form

Input

 

 

 

Expr

x - 2 × y

1

Expr + Term

x - 2 × y

3

Term + Term

x - 2 × y

6

Factor + Term

x - 2

× y

9

Identifier + Term

x - 2

× y

 

Identifier + Term

x - 2

× y

The parser begins with Expr, the grammar’s start symbol, and expands it, using rule 1. Since it is deriving a leftmost derivation, at each step, it considers the leftmost, unmatched symbol on the parse tree’s lower fringe. Thus, it tries to rewrite Expr to derive Identifier. To do this, it rewrites the first non-terminal, Expr, into Term using rule 3. Then, it rewrites Term into Factor using rule 6, and Factor into Identifier using rule 9.

At this point, the leftmost symbol is a terminal symbol, so it checks for a match against the input stream. The words match, so it advances the input stream by calling the scanner (denoted by the in the first column), and it moves rightward by one symbol along the parse tree’s lower fringe.

After advancing, the parser again faces a terminal symbol as its leftmost unmatched symbol. It checks for a match against the current input symbol and discovers that the symbol + in the parse tree cannot match in the input stream. At this point, one of two cases must hold:

1.The parser made a poor selection at some earlier expansion; if this is the case, it can backtrack and consider the alternatives for each choice already made.

2.The input string is not a valid sentence; if this is the case, the parser can only detect it by running out of possibilities in its backtracking.

In the example, the actual misstep occurred in the first expansion, when the parser rewrote Expr using rule 1. To correct that decision, the parser would need to retract the most recent rewrite, by rule 9, and try the other possibilities for expanding Factor. Of course, neither rule 7 nor rule 8 generate matches against the first input symbol, so it then backtracks on the expansion of Term, considering rules 4 and 5 as alternatives to rule 6. Those expansions will eventually fail, as well, since neither × nor ÷ match . Finally, the parser will reconsider the rewrite of Expr with rule 1. When it tries rule 2, it can continue, at least for a while. In the derivation tale, we denote the entire backtracking sequence with the action “”. This line reads “backtrack to this sentential form and input state.”

66

 

 

CHAPTER 3. PARSING

 

Rule or

 

 

 

 

 

 

 

 

 

 

Action

Sentential form

Input

 

 

 

Expr

x - 2 × y

2

Expr − Term

x - 2 × y

3

Term − Term

x - 2 × y

6

Factor − Term

x - 2 × y

9

Identifier − Term

x - 2

× y

 

Identifier − Term

x - 2

× y

 

Identifier − Term

x - 2

× y

Working from the expansion by rule 2, the parser has worked its way back to matching Identifier against x and advancing both the input symbol and the position on the fringe. Now, the terminal symbol on the fringe matches the input symbol, so it can advance both the fringe and the input symbol, again. At this point, the parser continues, trying to match Term against the current input symbol 2.

Rule or

 

 

 

 

Action

Sentential form

Input

 

 

 

 

 

 

6

Identifier − Factor

x

- 2

× y

8

Identifier Number

x

- 2

× y

Identifier Number x - 2 ↑× y

The natural way to rewrite the fringe toward matching 2 is to rewrite by rule 6 and then rule 8. Now, the parser can match the non-terminal Number against the input symbol 2. When it goes to advance the input symbol and the unmatched node on the fringe, it discovers that it has no symbols left on the fringe, but it has more input to consume. This triggers another round of backtracking, back to the rewrite of Term; when it rewrites Term with rule 4, it can proceed to a final and correct parse.

Rule or

 

 

 

Action

Sentential form

Input

 

Identifier − Term

x - 2 × y

4

Identifier − Term × Factor

x - 2 × y

6

Identifier − Factor × Factor

x - 2 × y

8

Identifier Number × Factor

x - 2 × y

Identifier Number × Factor

x - 2 ↑× y

Identifier Number × Factor

x - 2

× ↑y

8

Identifier Number × Number

x - 2

× ↑y

Identifier Number × Number

x - 2

× y

Finally, the parser has reached a configuration where it has systematically eliminated all non-terminals from the lower fringe of the parse tree, matched each leaf of the parse tree against the corresponding symbol in the input stream, and exhausted the input stream. This was the definition of success, so it has constructed a legal derivation for x 2 × y.

3.3. TOP-DOWN PARSING

67

3.3.2Left Recursion

Clearly, the choice of a rewrite rule at each step has a strong impact on the amount of backtracking that the parser must perform. Consider another possible sequence of reductions for the same input string

Rule or

 

 

 

 

Action

Sentential form

Input

 

 

Expr

x - 2 × y

1

Expr + Term

x - 2 × y

1

Expr + Term + Term

x - 2 × y

1

Expr + Term + · · ·

x - 2

×

y

1

Expr + Term + · · ·

x - 2

×

y

1

· · ·

x - 2

×

y

Here, the parser follows a simple rule. For each instance of a non-terminal, it cycles through the productions in the order that they appear in the grammar. Unfortunately, rule 1 generates a new instance of Expr to replace the old instance, so it generates an ever-expanding fringe without making any measurable progress.

The problem with this example arises from the combination of left recursion in the grammar and the top-down parsing technique. A production is said to be left-recursive if the first symbol on its right hand side is the same as its left-hand side, or if the left-hand side symbol appears in the right-hand side, and all the symbols that precede it can derive the empty string. Left recursion can produce non-termination in a top-down parser because it allows the algorithm to expand the parse tree’s lower fringe indefinitely without generating a terminal symbol. Since backtracking is only triggered by a mismatch between a terminal symbol on the fringe and the current input symbol, the parser cannot recover from the expansion induced by left recursion.

We can mechanically transform a grammar to remove left recursion. For an obvious and immediate left recursion, shown on the left, we can rewrite it using

right recursion as shown on the right.

 

 

 

fee

fee α

fee

β fie

|

β

fie

α fie

 

 

 

|

 

The transformation introduces a new non-terminal, fie, and transfers the recursion onto fie. It also adds the rule fie, where represents the empty string. This -production requires careful interpretation in the parsing algorithm. If the parser expands by the rule fie, the e ect is to advance the current node along the tree’s fringe by one position.

For a simple, immediate left recursion, we can directly apply the transformation. In our expression grammar, this situation arises twice—in the rules for Expr and the rules for Term;

68

 

 

 

 

 

 

 

 

CHAPTER 3. PARSING

 

 

Original

 

 

 

 

Transformed

 

Expr

Expr + Term

 

Expr

Term Expr

 

 

 

|

Expr

Term

 

Expr

+ Term Expr

 

 

|

Term

 

 

 

 

|

− Term Expr

 

 

 

 

×

 

 

 

|

 

 

 

Term

Term

Factor

 

Term

Factor Term

 

 

|

Term ÷ Factor

 

Term

× Factor Term

 

 

|

Factor

 

 

 

|

÷ Factor Term

 

 

 

 

 

 

 

 

|

 

 

Plugging these replacements into the classic expression grammar yields:

1.

Expr

Term Expr

2.

Expr

+ Term Expr

3.| − Term Expr

4.

 

|

 

 

5.

Term

Factor Term

6.

Term

×

Factor Term

 

 

7.| ÷ Factor Term

8.

|

 

9.

Factor

( Expr )

10.| Number

11.| Identifier

This grammar describes the same set of expressions as the classic expression grammar. It uses right-recursion. It retains the left-associativity of the original grammar. It should work well with a top-down parser.

Rule or

 

 

 

 

 

 

 

 

 

 

 

 

 

Action

Sentential Form

Input

 

 

 

 

Expr

 

 

 

x

 

2 × y

1

Term Expr

 

x

 

2

×

y

 

Factor Term Expr

x

 

2

y

5

 

×

 

 

 

 

 

11

Id Term

Expr

x

 

2 × y

Id Term

Expr

x ↑− 2 × y

8

Id Expr

 

 

x

↑−

2

 

×

y

3

Id

Term Expr

x

↑−

2

 

×

y

 

 

 

 

 

Id − Term Expr

x

 

2 × y

5

Id

Factor Term Expr

x

 

2

 

×

y

 

 

 

 

 

 

10

Id Num Term Expr

x

 

2 × y

Id Num Term Expr

x

 

2

↑× y

6

Id

Num

×

Factor Term Expr

x

 

2

↑×

y

 

 

 

 

 

 

Id Num × Factor Term Expr

x

 

2

× ↑ y

11

Id Num × Id Term Expr

x

 

2

× ↑ y

Id Num × Id Term Expr

x

 

2

× y

8

Id Num × Id Expr

x

 

2

× y

4

Id Num × Id

x

 

2

× y

Соседние файлы в предмете Электротехника