Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

BookBody

.pdf
Скачиваний:
13
Добавлен:
23.08.2013
Размер:
1.38 Mб
Скачать

Sec. 4.2]

The CYK parsing method

101

most one results in a left-most derivation, consistently taking the right-most one results in a right-most derivation.

Notice that we can use an Unger-style parser for this. However, it would not have to generate all partitions any more, because we already know which partitions will work.

Let us try to find a left-most derivation for the example sentence and grammar, using the recognition table of Figure 4.14. We begin with the start symbol, Number. Our sentence contains seven symbols, which is certainly more than one, so we have to use one of the rules with a right-hand side of the form AB. The Integer Digit rule is not applicable here, because the only instance of Digit that could lead to a derivation

of the sentence is the one in Rs 7,1 , but Integer is not a member of Rs 1,6 . The Integer Fraction rule is not applicable either, because there is no Fraction

deriving the last part of the sentence. This leaves us with the production rule Number -> N1 Scale’, which is indeed applicable, because N1 is a member of Rs 1,4 , and Scale’ is a member of Rs 5,3 , so N1 derives 32.5 and Scale’ derives e+1.

Next, we have to find out how N1 derives 32.5. There is only one applicable rule: N1 -> Integer Fraction, and it is indeed applicable, because Integer is a

member of Rs 1,2 , and Fraction is a member of Rs 3,2 , so Integer derives 32, and Fraction derives .5. In the end, we find the following derivation:

Number -> N1 Scale’ ->

Integer Fraction Scale’ -> Integer Digit Fraction Scale’ -> 3 Digit Fraction Scale’ ->

3 2 Fraction Scale’ ->

3 2 T1 Integer Scale’ ->

3 2 . Integer Scale’ ->

3 2 . 5 Scale’ ->

3 2 . 5 N2 Integer ->

3 2 . 5 T2 Sign Integer ->

3 2 . 5 e Sign Integer ->

3 2 . 5 e + Integer ->

3 2 . 5 e + 1

Unfortunately, this is not exactly what we want, because this is a derivation that uses the rules of the grammar of Figure 4.13, not the rules of the grammar of Figure 4.4, the one that we started with.

4.2.6 Undoing the effect of the CNF transformation

When we examine the grammar of Figure 4.4 and the recognition table of Figure 4.14, we see that the recognition table contains the information we need on most of the nonterminals of the original grammar. However, there are a few non-terminals missing in the recognition table: Scale, Real, and Empty. Scale and Empty were removed because they became non-reachable, after the elimination of ε-rules. Empty was removed altogether, because it only derived the empty string, and Scale was replaced by Scale’, where Scale’ derives exactly the same as Scale, except for the empty

102 General non-directional methods [Ch. 4

string. We can use this to add some more information to the recognition table: at every occurrence of Scale’, we add Scale.

The non-terminal Real was removed because it became non-reachable after eliminating the unit rules. Now, the CYK algorithm does not require that all non-terminals in the grammar be reachable. We could just as well have left the non-terminal Real in the grammar, and have transformed its rules to CNF. The CYK algorithm would then have added Real to the recognition table, wherever that would be appropriate. The rules for Real that would be added to the grammar of Figure 4.13 are:

Real -> N1 Scale’ | Integer Fraction

The resulting recognition table is presented in Figure 4.15. In this figure, we have also added an extra row at the bottom of the triangle. This extra row represents the non-terminals that derive the empty string. These non-terminals can be considered as possibly occurring between any two adjacent symbols in the sentence, and also in front of or at the end of the sentence. The set Rsi, 0 represents the non-terminals that can be considered as possibly occurring just in front of symbol zi and the set Rsn +1,0 represents the ones that can occur at the end of the sentence.

7

Number,

 

 

 

 

 

 

 

 

 

 

Real

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6

 

Number,

 

 

 

 

 

 

Real

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

4

Number,

 

 

 

 

 

 

 

 

 

Real,

 

 

 

 

 

 

 

 

 

 

 

 

N1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

l

 

3

 

Number,

 

 

Scale’,

 

 

 

 

Real,

 

 

 

 

Scale

 

 

 

 

 

 

 

N1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2

Number,

 

Fraction

 

N2

 

 

 

 

Integer

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

Number,

Number,

 

Number,

 

 

Number,

 

Integer,

Integer,

T1

Integer,

T2

Sign

Integer,

 

 

 

 

Digit

Digit

 

Digit

 

 

Digit

 

 

 

 

 

 

 

 

 

 

 

 

0

Scale,

Scale,

Scale,

Scale,

Scale,

Scale,

Scale,

Scale,

Empty

Empty

Empty

Empty

Empty

Empty

Empty

Empty

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

3

2

.

5

e

+

1

 

 

 

 

1

2

3

4

5

6

7

8

i

Figure 4.15 The recognition table with Scale, Real, and Empty added

Now, we have a recognition table which contains all the information we need to parse a

Sec. 4.2] The CYK parsing method 103

sentence with the original grammar. Again, a derivation starts with the start-symbol S.

If A 1 A 2 . . . Am

is a right-hand side of S, we want to know if this rule can be applied,

that is, if A 1 A 2

. . . Am derives s 1,n . This is checked, starting with A 1 . There are two

cases:

 

A 1 is a terminal symbol. In this case, it must be the first symbol of s 1,n , or this rule is not applicable. Then, we must check if A 2 . . . Am derives s 2,n −1 , in the same way that we are now checking if A 1 A 2 . . . Am derives s 1,n .

A 1 is a non-terminal. In this case, it must be a member of a Rs 1,k , for some k, or this rule is not applicable. Then, we must check if A 2 . . . Am derives sk +1,n k , in the same way that we are now checking if A 1 A 2 . . . Am derives s 1,n . If we want all parsings, we must do this for every k for which A 1 is a member of Rs 1,k . Notice that non-terminals deriving the empty string pose no problem at all, because they appear as a member of Rsi, 0 for all i.

We have now determined whether the rule is applicable, and if it is, which parts of the rule derive which substrings. The next step now is to determine how the substrings can be derived. These tasks are similar to the task we started with, and are solved in the same way. This process will terminate at some time, provided the grammar does not contain loops. This is simply an Unger parser that knows in advance which partitions will lead to a successful parse.

Let us go back to the grammar of Figure 4.4 and the recognition table of Figure 4.15, and see how this works for our example input sentence. We now know that Number does derive 32.5e+1, and want to know how. We first ask ourselves: can we

use the Number -> Integer rule? Integer is a member of Rs 1,1 and Rs 1,2 , but there is nothing behind the Integer in the rule to derive the rest of the sentence, so we can-

not use this rule. Can we use the Number -> Real rule? Yes we can, because Real

is a member of Rs 1,7 , and the length of the sentence is 7. So, we start our derivation with

Number -> Real -> ...

Now, we get similar questions for the Real non-terminal: can we use the Real -> Integer Fraction Scale rule? Well, Integer is a member of Rs 1,1 , but we cannot find a Fraction in any of the Rs 2,k sets. However, Integer is also a member of

Rs 1,2 , and Fraction is a member of Rs 3,2 . Now, Scale is a member of Rs 5,0 ; this does not help because it would leave nothing in the rule to derive the rest. Fortunately,

Scale is also a member of Rs 5,3 , and that matches exactly to the end of the string. So, this rule is indeed applicable, and we continue our derivation:

Number -> Real -> Integer Fraction Scale -> ...

The sentence is now split up into three parts:

104

 

General non-directional methods

[Ch. 4

 

 

 

 

 

 

Number

 

 

 

 

 

 

Real

 

 

 

 

 

 

 

 

 

Integer Fraction Scale

 

 

 

 

 

 

3 2 . 5 e + 1

It is left to the reader to verify that we will find only one derivation, and that this is it:

Number ->

Real ->

Integer Fraction Scale -> Integer Digit Fraction Scale -> Digit Digit Fraction Scale -> 3 Digit Fraction Scale ->

3 2 Fraction Scale ->

3 2 . Integer Scale ->

3 2 . Digit Scale ->

3 2 . 5 Scale ->

3 2 . 5 e Sign Integer ->

3 2 . 5 e + Integer ->

3 2 . 5 e + Digit ->

3 2 . 5 e + 1

4.2.7 A short retrospective of CYK

We have come a long way. We started with building a recognition table using the original grammar. Then we found that using the original grammar with its unit rules and ε-rules is somewhat complicated, although it can certainly be done. We proceeded by transforming the grammar to CNF. CNF does not contain unit rules or ε-rules; our gain in this respect was that the algorithm for constructing the recognition table became much simpler. The limitation of the maximum length of a right-hand side to 2 was a gain in efficiency, and also a little in simplicity. However, Sheil [CF 1976] has demonstrated that the efficiency only depends on the maximum number of non-terminals occurring in a right-hand side of the grammar, not on the length of the right-hand sides. This can easily be understood, once one realizes that the efficiency depends (among others) on the number of cuts in a substring that are “difficult” to find, when checking whether a right-hand side derives this substring. This number of “difficult” cuts only depends on the number of non-terminals in the right-hand side. So, for efficiency, CNF is a bit too restrictive.

A disadvantage of this transformation to CNF is that the resulting recognition table lacks some information that we need to construct a derivation using the original grammar. In the transformation process, some non-terminals were thrown away, because they became non-productive. Fortunately, the missing information could easily be added. Ultimately, this process resulted in almost the same recognition table that we would get with our first attempt using the original grammar. It only contains some extra information on non-terminals that were added during the transformation of the grammar to CNF. More importantly, however, it was obtained in a simpler and much more efficient way.

Sec. 4.2]

The CYK parsing method

105

4.2.8 Chart parsing

The CYK algorithm is also known under the name of chart parsing. More precisely, both techniques have a number of variants and some variants of the CYK algorithm are identical to some variants of chart parsing. The most striking difference between them lies in the implementation; conceptually both algorithms do the same thing: they collect possible parsings for larger and larger chunks of the input.

Although often presented in a different format, a chart is just a recognition table. Figure 4.16 shows the recognition table of Figure 4.14 in a chart format: each arc represents a non-terminal deriving the part of the sentence spanned by the arc.

 

 

 

Number

 

 

 

 

 

 

Number

 

 

 

 

Number

 

 

 

 

 

 

N1

 

 

 

 

 

 

Number

 

 

 

 

 

N1

 

 

Scale’

 

Number

 

 

 

 

 

Integer

 

Fraction

 

N2

 

Number

Number

 

Number

 

 

Number

Integer

Integer

 

Integer

 

 

Integer

Digit

Digit

T1

Digit

T2

Sign

Digit

3

2

.

5

e

+

1

Figure 4.16 The recognition table of Figure 4.14 in chart format

Several variants of chart parsing are discussed and compared in Bolc [NatLang 1987].

5

Regular grammars and finite-state automata

Regular grammars are the simplest form of grammars that still have generative power. They can describe concatenation (joining two texts together) and repetition and can specify alternatives, but they cannot express nesting. Regular grammars are probably the best-understood part of formal linguistics and almost all questions about them can be answered.

5.1 APPLICATIONS OF REGULAR GRAMMARS

In spite of their simplicity there are many applications of regular grammars, of which we will briefly mention the most important ones.

5.1.1 CF parsing

In some parsers for CF grammars, a subparser can be discerned that handles a regular grammar; such a subparser is based implicitly or explicitly on the following surprising phenomenon. Consider the sentential forms in left-most or right-most derivations. Such a sentential form consists of a closed (finished) part, which contains terminal symbols only and an open (unfinished) part which contains non-terminals as well. In left-most derivations, the open part starts at the left-most non-terminal and extends to the right, in right-most derivations, the open part starts at the right-most non-terminal and extends to the left; see Figure 5.1 which uses sentential forms from Section 2.5.2.

d , N & N

N , N & h

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 5.1 Open parts in left-most and right-most productions

Now it can be proved (and it is not difficult to show) that these open parts can be described by a regular grammar (which follows from the CF grammar). Furthermore, these open parts of the sentential form play an important role in some CF parsing methods which explains the significance of regular grammars for CF parsing.

Sec. 5.1] Applications of regular grammars 107

5.1.2 Systems with finite memory

Since CF (or stronger) grammars allow nesting and since nesting can, in principle, be arbitrarily deep, the generation of correct CF (or stronger) sentences may, in principle, require an arbitrary amount of memory to temporarily hold the unprocessed nesting information. Mechanical systems do not possess an arbitrary amount of memory and consequently cannot exhibit CF behaviour and are restricted to regular behaviour. This is immediately clear for simple mechanical systems like vending machines, traffic lights and video-recorders: they all behave according to a regular grammar. It is also in principle true for more complicated mechanical systems, like a country’s train system or a computer. Here, the argument gets, however, rather vacuous since nesting information can be represented very efficiently and a little memory can take care of a lot of nesting. Consequently, although these systems in principle exhibit regular behaviour, it is often easier to describe them with CF or stronger means, even though that would incorrectly ascribe infinite memory to them.

Conversely, the global behaviour of many systems that do have much memory can still be described by a regular grammar, and many CF grammars are already for a large part regular. This is because regular grammars already take adequate care of concatenation, repetition and choice; context-freeness is only required for nesting. If we apply a rule that produces a regular (sub)language (and which consequently could be replaced by a regular rule) “quasi-regular”, we can observe the following. If all alternatives of a rule contain terminals only, that rule is quasi-regular (choice). If all alternatives of a rule contain only terminals and non-terminals the rules of which are quasi-regular and non-recursive, then that rule is quasi-regular (concatenation). And if a rule is recursive but recursion occurs only at the end of an alternative and involves only quasi-regular rules, then that rule is again quasi-regular (repetition). This often covers large parts of a CF grammar. See Krzemien´ and Łukasiewicz [FS 1976] for an algorithm to identify all quasi-regular rules in a grammar.

Natural languages are a case in point. Although CF or stronger grammars seem necessary to delineate the set of correct sentences (and they may very well be, to catch many subtleties), quite a good rough description can be obtained through regular languages. Consider the stylized grammar for the main clause in an Subject-Verb- Object (SVO) language in Figure 5.2.

MainClause

-> Subject Verb Object

Subject

-> [ a | the ] Adjective* Noun

Object

-> [ a | the ] Adjective* Noun

Verb

-> verb1 | verb2 | ...

Adjective

-> adj1 | adj2 | ...

Noun

-> noun1 | noun2 | ...

Figure 5.2 A not obviously quasi-regular grammar

This grammar is quasi-regular: Verb, Adjective and Noun are regular by themselves, Subject and Object are concatenations of repetitions of regular forms (regular nonterminals and choices) and are therefore quasi-regular, and so is MainClause. It takes some work to bring this grammar into standard regular form, but it can be done, as shown in Figure 5.3, in which the lists for verbs, adjectives and nouns have been abbreviated to verb, adjective and noun, to save space. Even (finite) context-

108

Regular grammars and finite-state automata

[Ch. 5

MainClause -> a SubjAdjNoun_verb_Object

MainClause -> the SubjAdjNoun_verb_Object

SubjAdjNoun_verb_Object -> noun verb_Object SubjAdjNoun_verb_Object -> adjective SubjAdjNoun_verb_Object

verb_Object -> verb Object

Object -> a ObjAdjNoun

Object -> the ObjAdjNoun

ObjAdjNoun -> noun

ObjAdjNoun -> adjective ObjAdjNoun

verb -> verb1 | verb2 | ...

adjective -> adj1 | adj2 | ...

noun -> noun1 | noun2 | ...

Figure 5.3 A regular grammar in standard form for that of Figure 5.2

dependency can be incorporated: for languages that require the verb to agree in number with the subject, we duplicate the first rule:

MainClause ->

SubjectSingular VerbSingular Object

|

SubjectPlural VerbPlural Object

and duplicate the rest of the grammar accordingly. The result is still regular. Nested subordinate clauses may seem a problem, but in practical usage the depth of nesting is severely limited. In English, a sentence containing a subclause containing a subclause containing a subclause will baffle the reader, and even in German and Dutch nestings over say five deep are frowned upon. We replicate the grammar the desired number of times and remove the possibility of further recursion from the deepest level. Then the deepest level is regular, which makes the other levels regular in turn. The resulting grammar will be huge but regular and will be able to profit from all simple and efficient techniques known for regular grammars. The required duplications and modifications are mechanical and can be done by a program. Dewar, Bratley and Thorne [NatLang 1969] describe an early example of this approach, Blank [NatLang 1989] a recent one.

5.1.3 Pattern searching

Many linear patterns, especially text patterns, have a structure that is easily expressed by a (quasi-)regular grammar. Notations that indicate amounts of money in various currencies, for instance, have the structure given by the grammar of Figure 5.4, where has been used to indicate a space symbol. Examples are $ 19.95 and ¥ 1600. Such notations, however, do not occur in isolation but are usually embedded in long stretches of text that itself does not conform to the grammar of Figure 5.4. To isolate the notations, a recognizer (rather than a parser) is derived from the grammar that will accept arbitrary text and will indicate where sequences of symbols are found that conform to

Sec. 5.1]

Applications of regular grammars

109

Amount

-> CurrencySymbol Space* Digit+ Cents?

 

S

-> ƒ | $ | ¥ | £ | ...

 

CurrencySymbol

 

Space

->

 

Digit -> [0123456789]

 

Cents

-> . Digit Digit | .--

 

Figure 5.4 A quasi-regular grammar for currency notations

the grammar. Parsing (or an other form of analysis) is deferred to a later stage. A technique for constructing such a recognizer is given in Section 5.3.4.

5.2 PRODUCING FROM A REGULAR GRAMMAR

When producing from a regular grammar, the producer needs to remember only one thing: which non-terminal is next. We shall illustrate this and further concepts using the simple regular grammar of Figure 5.5.

SS

->

a A

S

->

a B

A

->

b B

A

->

b C

B

->

c A

B

->

c C

C

->

a

Figure 5.5 Sample regular grammar

This grammar produces sentences consisting of an a followed by an alternating sequence of b’s and c’s followed by a terminating a. For the moment we shall restrict ourselves to regular grammars in standard notation; further on we shall extend our methods to more convenient forms.

The one non-terminal the producer remembers is called its state and the producer is said to be in that state. When a producer is in a given state, for instance, A, it chooses one of the rules belonging to that state, for instance, A->bC, produces the b and moves to state C. Such a move is called a state transition. It is customary to represent the states and the possible transitions of a producer in a transition diagram, Figure 5.6, where the above state transition is represented by the arc marked b from A to C.

 

A

 

 

 

 

a

b

 

 

S

c b

C

a

à

 

 

a

c

 

 

 

B

 

 

 

Figure 5.6 Transition diagram for the regular grammar of Figure 5.5

110

Regular grammars and finite-state automata

[Ch. 5

S is the initial state and the accepting state is marked à; another convention (not used here) is to draw an accepting state as a double circle. The symbols on the arcs are those produced by the corresponding move. The producer stops when it is in an accepting state. Like the non-deterministic automaton we saw in Section 3.4, the producer is an automaton, a finite non-deterministic automaton, or finite-state automaton, to be exact. “Finite” because it can only be in a finite number of states (5 in this case; 3 bits of internal memory would suffice) and “non-deterministic” because, for instance, in state S it has more than one way to produce an a.

5.3 PARSING WITH A REGULAR GRAMMAR

The above automaton for producing a sentence can in principle also be used for parsing. If we have a sentence, for instance, abcba, and want to check and parse it, we can view the above transition diagram as a maze and the (tokens in the) sentence as a guide. If we manage to follow a path through the maze, matching symbols from our sentence to those on the walls of the corridors as we go and end up in à exactly at the end of the sentence, we have checked the sentence and the names of the rooms we have visited form the backbone of the parse tree. See Figure 5.7, where the path is shown as a dotted line.

 

 

 

 

 

.

 

 

 

 

 

 

 

 

 

 

. . .

 

 

 

 

 

 

 

 

 

. . .a. . . .

. . A. . .

. . .b. . . .

 

 

 

 

 

 

 

. . .

.... ....

. . .

 

 

 

. . . . . . . . . . . . .

 

.. ..

 

. . . . . .a. . . . . . .

 

 

. .

 

 

 

 

S

 

.c b.

 

C

 

à

 

 

 

 

 

 

 

 

 

 

.. ..

 

 

 

 

 

 

 

 

 

a

.... ....

c

 

 

 

 

 

 

 

 

 

..

 

 

 

 

 

 

 

 

 

 

B

 

 

 

 

 

S A B A C à

a

b

c

b

a

Figure 5.7 Actual and linearized passage through the maze

Now this is easier said than done. How did we know, for instance, to turn left in room S rather than right? Of course we could employ general maze-solving techniques (and they would give us our answer in exponential time) but a much simpler and much more efficient answer is available here: we split ourselves in two and head both ways. After the first a of abcba we are in the set of rooms {A, B}. Now we have a b to follow; from B there are no exits marked b but from A there are two, which lead to B and C. So we are now in rooms {B, C}. Our path is now more difficult to depict but still easy to linearize, as shown in Figure 5.8. We can find the parsing by starting at the end and following the pointers backwards: à <- C <- A <- B <- A <- S. If the grammar is ambiguous the backward pointers may bring us to a fork in the road: an ambiguity has been found and both paths have to be followed separately to find both parsings. With regular grammars, however, one is often not interested in the parse, but only in the recognition: the fact that the input is correct and it ends here suffices.

Соседние файлы в предмете Электротехника