BookBody
.pdfSec. 4.2] |
The CYK parsing method |
101 |
most one results in a left-most derivation, consistently taking the right-most one results in a right-most derivation.
Notice that we can use an Unger-style parser for this. However, it would not have to generate all partitions any more, because we already know which partitions will work.
Let us try to find a left-most derivation for the example sentence and grammar, using the recognition table of Figure 4.14. We begin with the start symbol, Number. Our sentence contains seven symbols, which is certainly more than one, so we have to use one of the rules with a right-hand side of the form AB. The Integer Digit rule is not applicable here, because the only instance of Digit that could lead to a derivation
of the sentence is the one in Rs 7,1 , but Integer is not a member of Rs 1,6 . The Integer Fraction rule is not applicable either, because there is no Fraction
deriving the last part of the sentence. This leaves us with the production rule Number -> N1 Scale’, which is indeed applicable, because N1 is a member of Rs 1,4 , and Scale’ is a member of Rs 5,3 , so N1 derives 32.5 and Scale’ derives e+1.
Next, we have to find out how N1 derives 32.5. There is only one applicable rule: N1 -> Integer Fraction, and it is indeed applicable, because Integer is a
member of Rs 1,2 , and Fraction is a member of Rs 3,2 , so Integer derives 32, and Fraction derives .5. In the end, we find the following derivation:
Number -> N1 Scale’ ->
Integer Fraction Scale’ -> Integer Digit Fraction Scale’ -> 3 Digit Fraction Scale’ ->
3 2 Fraction Scale’ ->
3 2 T1 Integer Scale’ ->
3 2 . Integer Scale’ ->
3 2 . 5 Scale’ ->
3 2 . 5 N2 Integer ->
3 2 . 5 T2 Sign Integer ->
3 2 . 5 e Sign Integer ->
3 2 . 5 e + Integer ->
3 2 . 5 e + 1
Unfortunately, this is not exactly what we want, because this is a derivation that uses the rules of the grammar of Figure 4.13, not the rules of the grammar of Figure 4.4, the one that we started with.
4.2.6 Undoing the effect of the CNF transformation
When we examine the grammar of Figure 4.4 and the recognition table of Figure 4.14, we see that the recognition table contains the information we need on most of the nonterminals of the original grammar. However, there are a few non-terminals missing in the recognition table: Scale, Real, and Empty. Scale and Empty were removed because they became non-reachable, after the elimination of ε-rules. Empty was removed altogether, because it only derived the empty string, and Scale was replaced by Scale’, where Scale’ derives exactly the same as Scale, except for the empty
102 General non-directional methods [Ch. 4
string. We can use this to add some more information to the recognition table: at every occurrence of Scale’, we add Scale.
The non-terminal Real was removed because it became non-reachable after eliminating the unit rules. Now, the CYK algorithm does not require that all non-terminals in the grammar be reachable. We could just as well have left the non-terminal Real in the grammar, and have transformed its rules to CNF. The CYK algorithm would then have added Real to the recognition table, wherever that would be appropriate. The rules for Real that would be added to the grammar of Figure 4.13 are:
Real -> N1 Scale’ | Integer Fraction
The resulting recognition table is presented in Figure 4.15. In this figure, we have also added an extra row at the bottom of the triangle. This extra row represents the non-terminals that derive the empty string. These non-terminals can be considered as possibly occurring between any two adjacent symbols in the sentence, and also in front of or at the end of the sentence. The set Rsi, 0 represents the non-terminals that can be considered as possibly occurring just in front of symbol zi and the set Rsn +1,0 represents the ones that can occur at the end of the sentence.
7 |
Number, |
|
|
|
|
|
|
|
|||
|
|
|
Real |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
6 |
|
Number, |
|
|
|
|
|
|
|||
Real |
|
|
|
|
|
|
|||||
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
5 |
|
|
|
|
|
|
|
|
|||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
Number, |
|
|
|
|
|
|
|
|
|
|
Real, |
|
|
|
|
|||||
|
|
|
|
|
|||||||
|
|
|
N1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
l |
|
3 |
|
Number, |
|
|
Scale’, |
|
|
|
|
|
Real, |
|
|
|
|||||||
|
Scale |
|
|
|
|||||||
|
|
|
|
N1 |
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
Number, |
|
Fraction |
|
N2 |
|
|
|
|
|
Integer |
|
|
||||||||
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
1 |
Number, |
Number, |
|
Number, |
|
|
Number, |
|
|||
Integer, |
Integer, |
T1 |
Integer, |
T2 |
Sign |
Integer, |
|
||||
|
|
|
Digit |
Digit |
|
Digit |
|
|
Digit |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0 |
Scale, |
Scale, |
Scale, |
Scale, |
Scale, |
Scale, |
Scale, |
Scale, |
|||
Empty |
Empty |
Empty |
Empty |
Empty |
Empty |
Empty |
Empty |
||||
|
|
|
|||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
2 |
. |
5 |
e |
+ |
1 |
|
|
|
|
|
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
i
Figure 4.15 The recognition table with Scale, Real, and Empty added
Now, we have a recognition table which contains all the information we need to parse a
Sec. 4.2] The CYK parsing method 103
sentence with the original grammar. Again, a derivation starts with the start-symbol S.
If A 1 A 2 . . . Am |
is a right-hand side of S, we want to know if this rule can be applied, |
that is, if A 1 A 2 |
. . . Am derives s 1,n . This is checked, starting with A 1 . There are two |
cases: |
|
A 1 is a terminal symbol. In this case, it must be the first symbol of s 1,n , or this rule is not applicable. Then, we must check if A 2 . . . Am derives s 2,n −1 , in the same way that we are now checking if A 1 A 2 . . . Am derives s 1,n .
A 1 is a non-terminal. In this case, it must be a member of a Rs 1,k , for some k, or this rule is not applicable. Then, we must check if A 2 . . . Am derives sk +1,n −k , in the same way that we are now checking if A 1 A 2 . . . Am derives s 1,n . If we want all parsings, we must do this for every k for which A 1 is a member of Rs 1,k . Notice that non-terminals deriving the empty string pose no problem at all, because they appear as a member of Rsi, 0 for all i.
We have now determined whether the rule is applicable, and if it is, which parts of the rule derive which substrings. The next step now is to determine how the substrings can be derived. These tasks are similar to the task we started with, and are solved in the same way. This process will terminate at some time, provided the grammar does not contain loops. This is simply an Unger parser that knows in advance which partitions will lead to a successful parse.
Let us go back to the grammar of Figure 4.4 and the recognition table of Figure 4.15, and see how this works for our example input sentence. We now know that Number does derive 32.5e+1, and want to know how. We first ask ourselves: can we
use the Number -> Integer rule? Integer is a member of Rs 1,1 and Rs 1,2 , but there is nothing behind the Integer in the rule to derive the rest of the sentence, so we can-
not use this rule. Can we use the Number -> Real rule? Yes we can, because Real
is a member of Rs 1,7 , and the length of the sentence is 7. So, we start our derivation with
Number -> Real -> ...
Now, we get similar questions for the Real non-terminal: can we use the Real -> Integer Fraction Scale rule? Well, Integer is a member of Rs 1,1 , but we cannot find a Fraction in any of the Rs 2,k sets. However, Integer is also a member of
Rs 1,2 , and Fraction is a member of Rs 3,2 . Now, Scale is a member of Rs 5,0 ; this does not help because it would leave nothing in the rule to derive the rest. Fortunately,
Scale is also a member of Rs 5,3 , and that matches exactly to the end of the string. So, this rule is indeed applicable, and we continue our derivation:
Number -> Real -> Integer Fraction Scale -> ...
The sentence is now split up into three parts:
104 |
|
General non-directional methods |
[Ch. 4 |
|
|
|
|
|
|
|
Number |
|||
|
|
|
|
|
|
Real |
|||
|
|
|
|
|
|
|
|
||
|
Integer Fraction Scale |
|||
|
|
|
|
|
|
3 2 . 5 e + 1 |
It is left to the reader to verify that we will find only one derivation, and that this is it:
Number ->
Real ->
Integer Fraction Scale -> Integer Digit Fraction Scale -> Digit Digit Fraction Scale -> 3 Digit Fraction Scale ->
3 2 Fraction Scale ->
3 2 . Integer Scale ->
3 2 . Digit Scale ->
3 2 . 5 Scale ->
3 2 . 5 e Sign Integer ->
3 2 . 5 e + Integer ->
3 2 . 5 e + Digit ->
3 2 . 5 e + 1
4.2.7 A short retrospective of CYK
We have come a long way. We started with building a recognition table using the original grammar. Then we found that using the original grammar with its unit rules and ε-rules is somewhat complicated, although it can certainly be done. We proceeded by transforming the grammar to CNF. CNF does not contain unit rules or ε-rules; our gain in this respect was that the algorithm for constructing the recognition table became much simpler. The limitation of the maximum length of a right-hand side to 2 was a gain in efficiency, and also a little in simplicity. However, Sheil [CF 1976] has demonstrated that the efficiency only depends on the maximum number of non-terminals occurring in a right-hand side of the grammar, not on the length of the right-hand sides. This can easily be understood, once one realizes that the efficiency depends (among others) on the number of cuts in a substring that are “difficult” to find, when checking whether a right-hand side derives this substring. This number of “difficult” cuts only depends on the number of non-terminals in the right-hand side. So, for efficiency, CNF is a bit too restrictive.
A disadvantage of this transformation to CNF is that the resulting recognition table lacks some information that we need to construct a derivation using the original grammar. In the transformation process, some non-terminals were thrown away, because they became non-productive. Fortunately, the missing information could easily be added. Ultimately, this process resulted in almost the same recognition table that we would get with our first attempt using the original grammar. It only contains some extra information on non-terminals that were added during the transformation of the grammar to CNF. More importantly, however, it was obtained in a simpler and much more efficient way.
Sec. 4.2] |
The CYK parsing method |
105 |
4.2.8 Chart parsing
The CYK algorithm is also known under the name of chart parsing. More precisely, both techniques have a number of variants and some variants of the CYK algorithm are identical to some variants of chart parsing. The most striking difference between them lies in the implementation; conceptually both algorithms do the same thing: they collect possible parsings for larger and larger chunks of the input.
Although often presented in a different format, a chart is just a recognition table. Figure 4.16 shows the recognition table of Figure 4.14 in a chart format: each arc represents a non-terminal deriving the part of the sentence spanned by the arc.
|
|
|
Number |
|
|
|
|
|
|
Number |
|
|
|
|
Number |
|
|
|
|
|
|
|
N1 |
|
|
|
|
|
|
Number |
|
|
|
|
|
|
N1 |
|
|
Scale’ |
|
Number |
|
|
|
|
|
|
Integer |
|
Fraction |
|
N2 |
|
|
Number |
Number |
|
Number |
|
|
Number |
Integer |
Integer |
|
Integer |
|
|
Integer |
Digit |
Digit |
T1 |
Digit |
T2 |
Sign |
Digit |
3 |
2 |
. |
5 |
e |
+ |
1 |
Figure 4.16 The recognition table of Figure 4.14 in chart format
Several variants of chart parsing are discussed and compared in Bolc [NatLang 1987].
5
Regular grammars and finite-state automata
Regular grammars are the simplest form of grammars that still have generative power. They can describe concatenation (joining two texts together) and repetition and can specify alternatives, but they cannot express nesting. Regular grammars are probably the best-understood part of formal linguistics and almost all questions about them can be answered.
5.1 APPLICATIONS OF REGULAR GRAMMARS
In spite of their simplicity there are many applications of regular grammars, of which we will briefly mention the most important ones.
5.1.1 CF parsing
In some parsers for CF grammars, a subparser can be discerned that handles a regular grammar; such a subparser is based implicitly or explicitly on the following surprising phenomenon. Consider the sentential forms in left-most or right-most derivations. Such a sentential form consists of a closed (finished) part, which contains terminal symbols only and an open (unfinished) part which contains non-terminals as well. In left-most derivations, the open part starts at the left-most non-terminal and extends to the right, in right-most derivations, the open part starts at the right-most non-terminal and extends to the left; see Figure 5.1 which uses sentential forms from Section 2.5.2.
d , N & N |
N , N & h |
|||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Figure 5.1 Open parts in left-most and right-most productions
Now it can be proved (and it is not difficult to show) that these open parts can be described by a regular grammar (which follows from the CF grammar). Furthermore, these open parts of the sentential form play an important role in some CF parsing methods which explains the significance of regular grammars for CF parsing.
Sec. 5.1] Applications of regular grammars 107
5.1.2 Systems with finite memory
Since CF (or stronger) grammars allow nesting and since nesting can, in principle, be arbitrarily deep, the generation of correct CF (or stronger) sentences may, in principle, require an arbitrary amount of memory to temporarily hold the unprocessed nesting information. Mechanical systems do not possess an arbitrary amount of memory and consequently cannot exhibit CF behaviour and are restricted to regular behaviour. This is immediately clear for simple mechanical systems like vending machines, traffic lights and video-recorders: they all behave according to a regular grammar. It is also in principle true for more complicated mechanical systems, like a country’s train system or a computer. Here, the argument gets, however, rather vacuous since nesting information can be represented very efficiently and a little memory can take care of a lot of nesting. Consequently, although these systems in principle exhibit regular behaviour, it is often easier to describe them with CF or stronger means, even though that would incorrectly ascribe infinite memory to them.
Conversely, the global behaviour of many systems that do have much memory can still be described by a regular grammar, and many CF grammars are already for a large part regular. This is because regular grammars already take adequate care of concatenation, repetition and choice; context-freeness is only required for nesting. If we apply a rule that produces a regular (sub)language (and which consequently could be replaced by a regular rule) “quasi-regular”, we can observe the following. If all alternatives of a rule contain terminals only, that rule is quasi-regular (choice). If all alternatives of a rule contain only terminals and non-terminals the rules of which are quasi-regular and non-recursive, then that rule is quasi-regular (concatenation). And if a rule is recursive but recursion occurs only at the end of an alternative and involves only quasi-regular rules, then that rule is again quasi-regular (repetition). This often covers large parts of a CF grammar. See Krzemien´ and Łukasiewicz [FS 1976] for an algorithm to identify all quasi-regular rules in a grammar.
Natural languages are a case in point. Although CF or stronger grammars seem necessary to delineate the set of correct sentences (and they may very well be, to catch many subtleties), quite a good rough description can be obtained through regular languages. Consider the stylized grammar for the main clause in an Subject-Verb- Object (SVO) language in Figure 5.2.
MainClause |
-> Subject Verb Object |
Subject |
-> [ a | the ] Adjective* Noun |
Object |
-> [ a | the ] Adjective* Noun |
Verb |
-> verb1 | verb2 | ... |
Adjective |
-> adj1 | adj2 | ... |
Noun |
-> noun1 | noun2 | ... |
Figure 5.2 A not obviously quasi-regular grammar
This grammar is quasi-regular: Verb, Adjective and Noun are regular by themselves, Subject and Object are concatenations of repetitions of regular forms (regular nonterminals and choices) and are therefore quasi-regular, and so is MainClause. It takes some work to bring this grammar into standard regular form, but it can be done, as shown in Figure 5.3, in which the lists for verbs, adjectives and nouns have been abbreviated to verb, adjective and noun, to save space. Even (finite) context-
108 |
Regular grammars and finite-state automata |
[Ch. 5 |
MainClause -> a SubjAdjNoun_verb_Object
MainClause -> the SubjAdjNoun_verb_Object
SubjAdjNoun_verb_Object -> noun verb_Object SubjAdjNoun_verb_Object -> adjective SubjAdjNoun_verb_Object
verb_Object -> verb Object
Object -> a ObjAdjNoun
Object -> the ObjAdjNoun
ObjAdjNoun -> noun
ObjAdjNoun -> adjective ObjAdjNoun
verb -> verb1 | verb2 | ...
adjective -> adj1 | adj2 | ...
noun -> noun1 | noun2 | ...
Figure 5.3 A regular grammar in standard form for that of Figure 5.2
dependency can be incorporated: for languages that require the verb to agree in number with the subject, we duplicate the first rule:
MainClause -> |
SubjectSingular VerbSingular Object |
| |
SubjectPlural VerbPlural Object |
and duplicate the rest of the grammar accordingly. The result is still regular. Nested subordinate clauses may seem a problem, but in practical usage the depth of nesting is severely limited. In English, a sentence containing a subclause containing a subclause containing a subclause will baffle the reader, and even in German and Dutch nestings over say five deep are frowned upon. We replicate the grammar the desired number of times and remove the possibility of further recursion from the deepest level. Then the deepest level is regular, which makes the other levels regular in turn. The resulting grammar will be huge but regular and will be able to profit from all simple and efficient techniques known for regular grammars. The required duplications and modifications are mechanical and can be done by a program. Dewar, Bratley and Thorne [NatLang 1969] describe an early example of this approach, Blank [NatLang 1989] a recent one.
5.1.3 Pattern searching
Many linear patterns, especially text patterns, have a structure that is easily expressed by a (quasi-)regular grammar. Notations that indicate amounts of money in various currencies, for instance, have the structure given by the grammar of Figure 5.4, where has been used to indicate a space symbol. Examples are $ 19.95 and ¥ 1600. Such notations, however, do not occur in isolation but are usually embedded in long stretches of text that itself does not conform to the grammar of Figure 5.4. To isolate the notations, a recognizer (rather than a parser) is derived from the grammar that will accept arbitrary text and will indicate where sequences of symbols are found that conform to
Sec. 5.1] |
Applications of regular grammars |
109 |
Amount |
-> CurrencySymbol Space* Digit+ Cents? |
|
S |
-> ƒ | $ | ¥ | £ | ... |
|
CurrencySymbol |
|
|
Space |
-> |
|
Digit -> [0123456789] |
|
|
Cents |
-> . Digit Digit | .-- |
|
Figure 5.4 A quasi-regular grammar for currency notations
the grammar. Parsing (or an other form of analysis) is deferred to a later stage. A technique for constructing such a recognizer is given in Section 5.3.4.
5.2 PRODUCING FROM A REGULAR GRAMMAR
When producing from a regular grammar, the producer needs to remember only one thing: which non-terminal is next. We shall illustrate this and further concepts using the simple regular grammar of Figure 5.5.
SS |
-> |
a A |
S |
-> |
a B |
A |
-> |
b B |
A |
-> |
b C |
B |
-> |
c A |
B |
-> |
c C |
C |
-> |
a |
Figure 5.5 Sample regular grammar
This grammar produces sentences consisting of an a followed by an alternating sequence of b’s and c’s followed by a terminating a. For the moment we shall restrict ourselves to regular grammars in standard notation; further on we shall extend our methods to more convenient forms.
The one non-terminal the producer remembers is called its state and the producer is said to be in that state. When a producer is in a given state, for instance, A, it chooses one of the rules belonging to that state, for instance, A->bC, produces the b and moves to state C. Such a move is called a state transition. It is customary to represent the states and the possible transitions of a producer in a transition diagram, Figure 5.6, where the above state transition is represented by the arc marked b from A to C.
|
A |
|
|
|
|
a |
b |
|
|
S |
c b |
C |
a |
à |
|
||||
|
a |
c |
|
|
|
B |
|
|
|
Figure 5.6 Transition diagram for the regular grammar of Figure 5.5
110 |
Regular grammars and finite-state automata |
[Ch. 5 |
S is the initial state and the accepting state is marked à; another convention (not used here) is to draw an accepting state as a double circle. The symbols on the arcs are those produced by the corresponding move. The producer stops when it is in an accepting state. Like the non-deterministic automaton we saw in Section 3.4, the producer is an automaton, a finite non-deterministic automaton, or finite-state automaton, to be exact. “Finite” because it can only be in a finite number of states (5 in this case; 3 bits of internal memory would suffice) and “non-deterministic” because, for instance, in state S it has more than one way to produce an a.
5.3 PARSING WITH A REGULAR GRAMMAR
The above automaton for producing a sentence can in principle also be used for parsing. If we have a sentence, for instance, abcba, and want to check and parse it, we can view the above transition diagram as a maze and the (tokens in the) sentence as a guide. If we manage to follow a path through the maze, matching symbols from our sentence to those on the walls of the corridors as we go and end up in à exactly at the end of the sentence, we have checked the sentence and the names of the rooms we have visited form the backbone of the parse tree. See Figure 5.7, where the path is shown as a dotted line.
|
|
|
|
|
. |
|
|
|
|
|
|
|
|
|
|
. . . |
|
|
|
|
|
|
|
|
|
. . .a. . . . |
. . A. . . |
. . .b. . . . |
|
|
|
|
|
|
|
. . . |
.... .... |
. . . |
|
|
|
||
. . . . . . . . . . . . . |
|
.. .. |
|
. . . . . .a. . . . . . . |
||||||
|
|
. . |
|
|
||||||
|
|
S |
|
.c b. |
|
C |
|
à |
||
|
|
|
|
|
||||||
|
|
|
|
|
.. .. |
|
|
|
|
|
|
|
|
|
a |
.... .... |
c |
|
|
|
|
|
|
|
|
|
.. |
|
|
|
|
|
|
|
|
|
|
B |
|
|
|
|
|
S A B A C à
a |
b |
c |
b |
a |
Figure 5.7 Actual and linearized passage through the maze
Now this is easier said than done. How did we know, for instance, to turn left in room S rather than right? Of course we could employ general maze-solving techniques (and they would give us our answer in exponential time) but a much simpler and much more efficient answer is available here: we split ourselves in two and head both ways. After the first a of abcba we are in the set of rooms {A, B}. Now we have a b to follow; from B there are no exits marked b but from A there are two, which lead to B and C. So we are now in rooms {B, C}. Our path is now more difficult to depict but still easy to linearize, as shown in Figure 5.8. We can find the parsing by starting at the end and following the pointers backwards: à <- C <- A <- B <- A <- S. If the grammar is ambiguous the backward pointers may bring us to a fork in the road: an ambiguity has been found and both paths have to be followed separately to find both parsings. With regular grammars, however, one is often not interested in the parse, but only in the recognition: the fact that the input is correct and it ends here suffices.