Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

BookBody

.pdf
Скачиваний:
13
Добавлен:
23.08.2013
Размер:
1.38 Mб
Скачать

4

General non-directional methods

In this chapter we will present two general parsing methods, both non-directional: Unger’s method and the CYK method. These methods are called non-directional because they access the input in an seemingly arbitrary order. They require the entire input to be in memory before parsing can start.

Unger’s method is top-down; if the input belongs to the language at hand, it must be derivable from the start symbol of the grammar. Therefore, it must be derivable from a right-hand side of the start symbol, say A 1 A 2 . . . Am . This, in turn, means that A 1 must derive a first part of the input, A 2 a second part, etc. If the input sentence is z 1 z 2 . . . zn , this demand can be depicted as follows:

A 1

. . .

Ai

. . .

Am

 

 

z 1

. . .

zk

. . .

zn

 

 

Unger’s method tries to find a partition of the input that fits this demand. This is a recursive problem: if a non-terminal Ai is to derive a certain part of the input, there must be a partition of this part that fits a right-hand side of Ai . Ultimately, such a right-hand side must consist of terminal symbols only, and these can easily be matched with the current part of the input.

The CYK method approaches the problem the other way around: it tries to find occurrences of right-hand sides in the input; whenever it finds one, it makes a note that the corresponding left-hand side derives this part of the input. Replacing the occurrence of the right-hand side with the corresponding left-hand side results in some sentential forms that derive the input. These sentential forms are again the subject of a search for right-hand sides, etc. Ultimately, we may find a sentential form that both derives the input sentence and is a right-hand side of the start symbol.

In the next two sections, these methods are investigated in detail.

82

General non-directional methods

[Ch. 4

4.1 UNGER’S PARSING METHOD

The input to Unger’s parsing method [CF 1968] consists of a CF grammar and an input sentence. We will first discuss Unger’s parsing method for grammars without ε-rules and without loops (see Section 2.8.4). Then, the problems introduced by ε-rules will be discussed, and the parsing method will be modified to allow for all CF grammars.

4.1.1 Unger’s method without ε-rules or loops

To see how Unger’s method solves the parsing problem, let us consider a small example. Suppose we have a grammar rule

S ABC | DE | F

and we want to find out whether S derives the input sentence pqrs. The initial parsing problem can then be schematically represented as:

S

pqrs

For each right-hand side we must first generate all possible partitions of the input sentence. Generating partitions is not difficult: if we have m cups, numbered from 1 to m, and n marbles, numbered from 1 to n, we have to find all possible partitions such that each cup contains at least one marble, the numbers of the marbles in any cup are consecutive, and any cup does not contain lower-numbered marbles than any marble in a lower-numbered cup. We proceed as follows: first, we put marble 1 in cup 1, and then generate all partitions of the other n −1 marbles over the other m −1 cups. This gives us all partitions that have marble 1 in the first cup. Next, we put marbles 1 and 2 in the first cup, and then generate all partitions of the other n −2 marbles over the other m −1 cups, etc. If n is less than m, no partition is possible.

Partitioning the input corresponds to partitioning the marbles (the input symbols) over the cups (the right-hand side symbols). If a right-hand side has more symbols than the sentence, no partition can be found (there being no ε-rules). For the first right-hand side the following partitions must be tried:

 

 

 

 

 

 

S

 

 

 

 

 

 

A B C

 

 

 

 

 

 

p q rs

p

 

qr

 

s

 

 

 

pq

 

r

 

s

 

The first partition results in the following sub-problems: does A derive p, does B derive q, and does C derive rs? These sub-problems must all be answered in the affirmative, or the partition is not the right one.

For the second right-hand side, we obtain the following partitions:

Sec. 4.1]

Unger’s parsing method

83

 

 

 

 

 

 

 

S

 

 

 

 

 

 

 

 

D E

 

 

 

 

 

 

 

 

p qrs

 

 

pq

 

rs

 

 

 

 

 

 

pqr

 

s

 

 

The last right-hand side results in the following partition:

S

F

pqrs

All these sub-problems deal with shorter sentences, except the last one. They will all lead to similar split-ups, and in the end many will fail because a terminal symbol in a right-hand side does not match the corresponding part of the partition. The only partition that causes some concern is the last one. It is as complicated as the one we started with. This is the reason that we have disallowed loops in the grammar. If the grammar has loops, we may get the original problem back again and again. For instance, if there is a rule F S in the example above, this will certainly happen.

The above demonstrates that we have a search problem here, and we can attack it with either the depth-first or the breadth-first search technique (see Section 3.6.2). Unger uses depth-first search.

In the following discussion, the grammar of Figure 4.1 will serve as an example.

ExprS

-> Expr + Term | Term

Term

->

Term × Factor | Factor

Factor

->

( Expr ) | i

Figure 4.1 A grammar describing simple arithmetic expressions

This grammar represents the language of simple arithmetic expressions, with operators + and ×, and operand i. We will use the sentence (i+i)×i as input example. So, the initial problem can be represented as:

Expr

×(i+i) i

Fitting the first right-hand side of Expr with the input results in the following partitions:

84

General non-directional methods

[Ch. 4

 

 

 

 

 

 

 

 

 

 

Expr

 

 

 

 

 

 

 

 

 

 

 

Expr + Term

 

 

 

 

 

 

 

 

 

 

 

( i +i)×i

 

 

(

 

i+

 

i)×i

 

 

 

 

 

 

 

 

(

i+i

 

)×i

 

 

 

 

(

 

i+i)

 

×i

 

 

 

 

(

i+i)×

i

 

 

 

 

(i

 

+

 

i)×i

 

 

 

 

 

 

 

(i

 

+i

 

)×i

 

 

 

(i

 

+i)

 

×i

 

 

 

 

 

 

 

 

(i

+i)×

i

 

 

 

 

(i+

 

i

 

)×i

 

 

 

 

(i+

i)

 

×i

 

 

 

 

(i+

 

i)×

 

i

 

 

 

 

 

 

 

(i+i

 

)

 

×i

 

 

 

(i+i

 

)×

 

i

 

 

 

 

 

 

 

(i+i)

 

×

i

 

 

Even a small example like this already results in 15 partitions, and we will not examine them all here, although the unoptimized version of the algorithm requires this. We will only examine the partitions that have at least some chance of succeeding: we can eliminate all partitions that do not match the terminal symbol of the right-hand side. So, the only partition worth investigating further is:

 

 

 

 

Expr

 

 

 

 

Expr + Term

 

 

 

 

(i + i)×i

The first sub-problem here is to find out whether and, if so, how Expr derives (i. We cannot partition (i into three non-empty parts because it only consists of 2 symbols. Therefore, the only rule that we can apply is the rule Expr -> Term. Similarly, the only rule that we can apply next is the rule Term -> Factor. So, we now have

Expr

Term

Factor

(i

However, this is impossible, because the first right-hand side of Factor has too many symbols, and the second one consists of one terminal symbol only. Therefore, the partition we started with does not fit, and it must be rejected. The other partitions were already rejected, so we can conclude that the rule Expr -> Expr + Term does not derive the input.

The second right-hand side of Expr consists of only one symbol, so we only have one partition here, consisting of one part. Partitioning this part for the first right-hand side of Term again results in 15 possibilities, of which again only one has a chance of

Sec. 4.1]

Unger’s parsing method

 

85

succeeding:

 

 

 

 

 

 

 

 

 

 

 

 

Expr

 

 

 

 

 

 

 

 

Term

 

 

 

 

 

 

 

 

 

 

 

 

Term × Factor

 

 

 

 

 

 

 

 

(i+i) × i

 

Continuing our search, we will find the following derivation:

Expr ->

Term ->

Term × Factor ->

Factor × Factor ->

( Expr ) × Factor ->

( Expr + Term ) × Factor -> ( Term + Term ) × Factor -> ( Factor + Term ) × Factor -> ( i + Term ) × Factor ->

( i + Factor ) × Factor -> ( i + i ) × Factor ->

( i + i ) × i

and this is the only derivation to be found.

This example demonstrates several aspects of the method: even small examples require a considerable amount of work, but even some simple checks can result in huge savings. For instance, matching the terminal symbols in a right-hand side with the partition at hand often leads to the rejection of the partition without investigating it any further. Unger [CF 1968] presents several more of these checks. For instance, one can compute the minimum length of strings of terminal symbols derivable from each nonterminal. Once it is known that a certain non-terminal only derives terminal strings of length at least n, all partitions that fit this non-terminal with a substring of length less than n can be immediately rejected.

4.1.2 Unger’s method with ε-rules

So far, we only have dealt with grammars without ε-rules, and not without reason. Complications arise when the grammar contains ε-rules, as is demonstrated by the following example: consider the grammar rule SABC and input sentence pqr. If we want to examine whether this rule derives the input sentence, and we allow for ε-rules, many more partitions will have to be investigated, because each of the non-terminals A, B, and C may derive the empty string. In this case, generating all partitions proceeds just as above, except that we first generate the partitions that have no marble at all in the first cup, then the partitions that have marble 1 in the first cup, etc.:

86

General non-directional methods

[Ch. 4

 

 

 

 

 

 

 

 

 

S

 

 

 

 

 

 

 

 

 

 

A B C

 

 

 

 

 

 

 

 

 

 

pqr

 

 

 

 

p

 

qr

 

 

 

 

 

 

 

 

 

pq

 

r

 

 

 

 

 

pqr

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

p

 

qr

 

 

 

 

 

 

 

 

p

q

r

 

 

 

p

 

qr

 

 

 

 

 

 

 

 

 

 

pq

 

 

 

r

 

 

 

pq

 

r

 

 

 

 

 

pqr

 

 

 

 

 

 

 

 

 

Now suppose that we are investigating whether B derives pqr, and suppose there is a rule B SD. Then, we will have to investigate the following partitions:

 

 

 

 

B

 

 

 

 

S D

 

 

 

 

pqr

p

 

qr

 

 

pq

 

r

 

pqr

 

 

 

It is the last of these partitions that will cause trouble: in the process of finding out whether S derives pqr, we end up asking the same question again, in a different context. If we are not careful and do not detect this, our parser will loop forever, or run out of memory.

When searching along this path, we are looking for a derivation that is using a loop in the grammar. This may even happen if the grammar does not contain loops. If this loop actually exists in the grammar, there are infinitely many derivations to be found along this path, provided that there is one, so we will never be able to present them all. The only interesting derivations are the ones without the loop. Therefore, we will cut off the search process in these cases. On the other hand, if the grammar does not contain such a loop, a cut-off will not do any harm either, because the search is doomed to fail anyway. So, we can avoid the problem altogether by cutting off the search process in these cases. Fortunately, this is not a difficult task. All we have to do is to maintain a list of questions that we are currently investigating. Before starting to investigate a new question (for instance “does S derive pqr?”) we first check that the question does not already appear in the list. If it does, we do not investigate this question. Instead, we proceed as if the question were answered negatively.

Consider for instance the following grammar:

S

->

LSD | ε

L

->

ε

D

->

d

This grammar generates sequences of d’s in an awkward way. The complete search for

Sec. 4.1]

Unger’s parsing method

87

the questions S * d? and S * dd? is depicted in Figure 4.2.

L * ε? ε →* ε? yes

 

L

S

D

 

 

 

 

 

 

*

 

*

 

S → ε?

 

 

 

L → ε?

 

ε → ε?

yes

*

 

 

 

 

 

 

 

-

-

-

 

 

 

 

 

 

 

 

S * ε? cut-off: no

 

 

 

 

 

 

*

yes

 

 

 

 

 

 

ε → ε?

 

L

S

D

 

 

 

 

 

*

 

*

 

 

 

 

 

 

 

*

 

 

 

D d?

 

d d?

yes

-

-

d

 

 

 

 

 

 

 

 

S d?

 

 

 

 

 

 

 

 

*

 

*

 

 

 

 

 

 

 

-

d

-

L → ε?

 

ε → ε?

yes

 

 

 

 

*

 

 

 

 

d

-

-

 

 

 

 

S d? cut-off: no

 

 

 

 

 

 

 

 

 

 

 

 

 

*

 

*

 

 

ε → d?

L d?

 

ε → d?

no

 

no

 

 

 

 

*

 

 

 

 

 

 

 

 

 

 

 

 

 

*

 

 

*

 

 

 

 

 

L → ε?

 

ε → ε?

yes

 

 

 

 

 

 

 

 

 

 

 

 

 

L

S

D

 

 

 

 

 

 

*

 

*

 

 

 

 

 

 

 

 

S → ε?

 

 

 

 

L → ε?

 

ε → ε?

yes

*

 

-

-

-

 

 

 

 

 

 

*

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

S → ε? cut-off: no

 

ε →* ε? yes

D * dd? d * dd? no

L * ε? ε →* ε? yes

L S D

S * d? see above, yes

-- dd

 

 

 

 

 

*

 

*

 

 

-

d

d

 

D d?

 

d d?

yes

-

dd -

 

 

 

 

*

 

*

 

 

 

 

 

 

 

 

 

 

 

 

 

L → ε?

 

ε → ε?

yes

 

d

-

d

*

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

*

 

d

-

 

S dd? cut-off: no

 

S dd? d

 

 

 

 

 

 

 

 

 

 

*

 

*

 

 

dd -

-

 

L d?

 

ε → d?

no

 

 

 

 

 

 

 

 

 

 

 

 

 

 

*

 

*

 

 

 

 

 

no

L d?

 

ε → d?

no

 

 

*

 

 

 

 

 

 

 

 

 

 

 

 

 

ε → dd?

 

 

 

 

 

 

 

 

 

*

 

*

 

 

 

 

 

 

 

 

 

 

 

 

 

L dd?

 

ε → dd?

no

Figure 4.2 Unger’s parser at work for the sentences d and dd

Figure 4.2 must be read from left to right, and from top to bottom. The questions are drawn in an ellipse, with the split-ups over the right-hand sides in boxes. A question is

88

General non-directional methods

[Ch. 4

answered affirmatively if at least one of the boxes results in a “yes”. In contrast, a partition only results in an affirmative answer if all questions arising from it result in a “yes”.

Checking for cut-offs is easy: if a new question is asked, we follow the arrows in the reversed direction (to the left). This way, we traverse the list of currently investigated questions. If we meet the question again, we have to cut off the search.

To find the parsings, every question that is answered affirmatively has to pass back a list of rules that start the derivation asked for in the question. This list can be placed into the ellipse, together with the question. We have not done so in Figure 4.2, because it is complicated enough as it is. However, if we strip Figure 4.2 of its dead ends, and leave out the boxes, we get Figure 4.3.

 

*

*

 

 

L → ε? yes

 

L → ε? yes

 

 

L → ε

 

L → ε

*

*

*

S dd? yes

 

S d? yes

 

S → ε? yes

S LSD

 

S LSD

 

S → ε

 

 

 

*

*

 

 

D d? yes

 

D d? yes

 

 

D d

 

D d

Figure 4.3 The result of Unger’s parser for the sentence dd

In this case, every ellipse only has one possible grammar rule. Therefore, there is only one parsing, and we obtain it by reading Figure 4.3 from left to right, top to bottom:

S -> LSD -> SD -> LSDD -> SDD -> DD -> dD -> dd.

In general, the total number of parsings is equal to the product of the number of grammar rules in each ellipse.

This example shows that we can save much time by remembering answers to questions. For instance, the question whether L derives ε is asked many times. Sheil [CF 1976] has shown that the efficiency improves dramatically when this is done: it goes from exponential to polynomial. Another possible optimization is achieved by computing in advance which non-terminals can derive ε. In fact, this is a special case of computing the minimum length of a terminal string that each non-terminal derives. If a non-terminal derives ε, this minimum length is 0.

4.2 THE CYK PARSING METHOD

The parsing method described in this section is attributed to J. Cocke, D.H. Younger, and T. Kasami, who, independently, discovered variations of the method; it is now known as the Cocke-Younger-Kasami method, or the CYK method. The most accessible original description is that of Younger [CF 1967]. A much earlier description is by Sakai [CF 1962].

As with Unger’s parsing method, the input to the CYK algorithm consists of a CF

Sec. 4.2]

The CYK parsing method

89

grammar and an input sentence. The first phase of the algorithm constructs a table telling us which non-terminal(s) derive which substrings of the sentence. This is the recognition phase. It ultimately also tells us whether the input sentence can be derived from the grammar. The second phase uses this table and the grammar to construct all possible derivations of the sentence.

We will first concentrate on the recognition phase, which really is the distinctive feature of the algorithm.

4.2.1 CYK recognition with general CF grammars

To see how the CYK algorithm solves the recognition and parsing problem, let us consider the grammar of Figure 4.4. This grammar describes the syntax of numbers in scientific notation. An example sentence produced by this grammar is 32.5e+1. We will now use this grammar and sentence as an example.

NumberS

->

Integer | Real

Integer

-> Digit | Integer Digit

Real

->

Integer Fraction Scale

Fraction -> . Integer

Scale

-> e Sign Integer | Empty

Digit

-> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Empty

->

ε

Sign

->

+ | -

Figure 4.4

A grammar describing numbers in scientific notation

The CYK algorithm first concentrates on substrings of the input sentence, shortest substrings first, and then works its way up. The following derivations of substrings of length 1 can be read directly from the grammar:

 

Digit

 

Digit

 

 

 

Digit

 

 

 

Sign

 

Digit

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

3

 

2

 

.

 

5

 

e

 

+

 

1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

This means that Digit derives 3, Digit derives 2, etc. Note however, that this picture is not yet complete. For one thing, there are several other non-terminals deriving 3. This complication arises because the grammar contains so-called unit rules, rules of the form A B, where A and B are non-terminals. Such rules are also called single rules or chain rules. We can have chains of them in a derivation. So, the next step consists of applying the unit rules, repetitively, for instance to find out which other non-terminals derive 3. This gives us the following result:

Number,

Number,

 

Number,

 

 

Number,

Integer,

Integer,

 

Integer,

 

Sign

Integer,

Digit

Digit

 

Digit

 

 

Digit

 

 

 

 

 

 

 

 

 

 

 

 

 

 

3

 

2

 

.

 

5

 

e

 

+

 

1

 

 

 

 

 

 

90

General non-directional methods

[Ch. 4

Now, we already see some combinations that we recognize from the grammar: For instance, an Integer followed by a Digit is again an Integer, and a . (dot) followed by an Integer is a Fraction. We get (again also using unit rules):

 

Number, Integer

 

 

Fraction

 

 

 

Scale

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Number,

 

Number,

 

 

 

 

Number,

 

 

 

 

 

Number,

 

 

Integer,

 

Integer,

 

 

 

 

Integer,

 

 

 

Sign

 

Integer,

 

 

Digit

 

Digit

 

 

 

 

Digit

 

 

 

 

 

Digit

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

3

 

2

 

.

 

 

5

 

e

 

+

 

1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

At this point, we see that the Real-rule is applicable in several ways, and then the Number-rule, so we get:

Number, Real

Number, Real

 

Number, Integer

 

 

Fraction

 

 

 

Scale

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Number,

 

Number,

 

 

 

 

Number,

 

 

 

 

 

Number,

 

Integer,

 

Integer,

 

 

 

 

Integer,

 

 

 

Sign

 

Integer,

 

Digit

 

Digit

 

 

 

 

Digit

 

 

 

 

 

Digit

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

3

 

2

 

.

 

 

5

 

e

 

+

 

1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

We find that Number does indeed derive 32.5e+1.

In the example above, we have seen that unit rules complicate things a bit. Another complication, one that we have avoided until now, is formed by ε-rules. For instance, if we want to recognize the input 43.1 according to the example grammar, we have to realize that Scale derives ε here, so we get the following picture:

Number, Real

Number, Real

 

Number, Integer

 

Fraction

 

Scale

 

 

 

 

 

 

 

 

 

Number,

 

Number,

 

 

Number

 

 

Integer,

 

Integer,

 

 

Integer

 

 

Digit

 

Digit

 

 

Digit

 

 

 

 

 

 

 

 

 

 

 

4

 

3

 

.

 

1

 

 

 

 

 

 

 

 

 

 

 

In general this is even more complicated. We must take into account the fact that several non-terminals can derive ε between any two adjacent terminal symbols in the input sentence, and also in front of the input sentence or at the back. However, as we shall see, the problems caused by these kinds of rules can be solved, albeit at a certain cost.

In the meantime, we will not let these problems discourage us. In the example,

Соседние файлы в предмете Электротехника