Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Cooper K.Engineering a compiler

.pdf
Скачиваний:
49
Добавлен:
23.08.2013
Размер:
2.31 Mб
Скачать

Chapter 2

Lexical Analysis

2.1Introduction

The scanner takes as input a stream of characters and produces as output a stream of words, along with their associated syntactic categories. It aggregates letters together to form words and applies a set of rules to determine whether or not the word is legal in the source language and, if so, its syntactic category. This task can be done quickly and e ciently using a specialized recognizer.

This chapter describes the mathematical tools and the programming techniques that are commonly used to perform lexical analysis. Most of the work in scanner construction has been automated; indeed, this is a classic example of the application of theoretical results to solve an important and practical problem—specifying and recognizing patterns. The problem has a natural mathematical formulation. The mathematics leads directly to e cient implementation schemes. The compiler writer specifies the lexical structure of the language using a concise notation and the tools transform that specification into an e cient executable program. These techniques have led directly to useful tools in other settings, like the Unix tool grep and the regular-expression pattern matching found in many text editors and word-processing tools. Scanning is, essentially, a solved problem.

Scanners look at a stream of characters and recognize words. The rules that govern the lexical structure of a programming language, sometimes called its micro-syntax, are simple and regular. This leads to highly e cient, specialized recognizers for scanning. Typically, a compiler’s front end has a scanner to handle its micro-syntax and a parser for its context-free syntax, which is more complex to recognize. This setup is shown in Figure 2.1. Separating microsyntax from syntax simplifies the compiler-writer’s life in three ways.

The description of syntax used in the parser is written in terms of words and syntactic categories, rather than letters, numbers, and blanks. This lets the parser ignore irrelevant issues like absorbing extraneous blanks, newlines, and comments. These are hidden inside the scanner, where they

19

20

 

 

 

CHAPTER 2. LEXICAL ANALYSIS

source

 

words

 

intermediate

scanner

parser

code

 

 

 

code

 

 

 

 

 

 

 

 

 

 

 

Front end

Figure 2.1: Structure of a typical front end

are handled cleanly and e ciently.

Scanner construction is almost completely automated. The lexical rules are encoded in a formal notation and fed to a scanner generator. The result is an executable program that produces the input for the parser. Scanners generated from high-level specifications are quite e cient.

Every rule moved into the scanner shrinks the parser. Parsing is harder than scanning; the amount of code in a parser grows as the grammar grows. Since parser construction requires more direct intervention from the programmer, shrinking the parser reduces the compiler-writer’s e ort.

As a final point, well-implemented scanners have lower overhead (measured by instructions executed per input symbol) than well-implemented parsers. Thus, moving work into the scanner improves the performance of the entire front end.

Our goal for this chapter is to develop the notations for specifying lexical patterns and the techniques to convert those patterns directly into an executable scanner. Figure 2.2 depicts this scenario. This technology should allow the compiler writer to specify the lexical properties at a reasonably high level and leave the detailed work of implementation to the scanner generator—without sacrificing e ciency in the final product, the compiler.

First, we will introduce a notation, called regular expressions, that works well for specifying regular expressions. We will explore the properties of regular expressions and their relationship to a particular kind of recognizer, called a finite automaton. Next, we will develop the techniques and methods that allow us to automate the construction of e cient scanners from regular expressions. We show some additional results that relate regular expressions and automata, and conclude with an example of how complex the task of lexical analysis can be in Fortran, a language designed before we had a good understanding of the mathematics of lexical analysis.

2.2Specifying Lexical Patterns

Before we can build a scanner, we need a way of specifying the micro-syntax of the source language—of specifying patterns that describe the words in the language. Some parts are quite easy.

Punctuation marks like colons, semi-colons, commas, parentheses, and square brackets can be denoted by their unique character representations:

2.2. SPECIFYING LEXICAL PATTERNS

 

21

source

-

 

 

 

words

 

intermediate

 

 

 

 

 

 

 

scanner

-

parser

-

code

code

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6

 

front end

 

 

lexical

 

 

 

 

 

 

 

 

 

 

-

 

scanner

 

 

 

 

patterns

 

 

 

generator

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 2.2: Automatic scanner generation

: ; , ( ) [ ]

Keywords, like if, then, and integer are equally simple. These words have a unique spelling, so we can represent them as literal patterns—we simply write them down.

Some simple concepts have more complicated representations. For example, the concept of a blank might require a small grammar.

WhiteSpace WhiteSpace blank

|WhiteSpace tab

| blank

|tab

where blank and tab have the obvious meanings.

Thus, for many words in the source language, we already have a concise representation—we can simply write down the words themselves. Other parts of a programming language are much harder to specify. For these, we will write down rules that can generate all of the appropriate strings.

Consider, for example, a pattern to specify when a string of characters forms a number. An integer might be described as a string of one or more digits, beginning with a digit other than zero, or as the single digit zero. A decimal number is simply an integer, followed by a decimal point, followed by a string of zero or more digits. (Notice that the part to the left of the decimal point cannot have leading zeros unless it is a zero, while the fractional part must be able to have leading zeros.) Real numbers and complex numbers are even more complicated. Introducing optional signs (+ or -) adds yet more clauses to the rules.

The rules for an identifier are usually somewhat simpler than those for a number. For example, Algol allowed identifier names that consisted of a single alphabetic character, followed by zero or more alphanumeric characters. Many languages include special characters such as the ampersand (&), percent sign (%), and underscore ( ) in the alphabet for identifier names.

22

CHAPTER 2. LEXICAL ANALYSIS

The complexity of lexical analysis arises from the simple fact that several of the syntactic categories used in a typical programming language contain an e ectively infinite set of words. In particular, both numbers and identifiers usually contain sets large enough to make enumeration impractical.1 To simplify scanner construction, we need to introduce a powerful notation to specify these lexical rules: regular expressions.

A regular expression describes a set of strings over the characters contained in some alphabet, Σ, augmented with a character that represents the empty string. We call the set of strings a language. For a given regular expression, r, we denote the language that it specifies as L(r). A regular expression is built up from three simple operations:

Union – the union of two sets R and S, denoted R S, is the set

{s | s R or s S}.

Concatenation – the concatenation of two sets R and S, denoted RS, is the set {st | s R and t S}. We will sometimes write R2 for RR, the concatenation of R with itself, and R3 for RRR (or RR2).

Closure – the Kleene closure of a set R, denoted R , is the set 0 Ri. This is just the concatenation of L with itself, zero or more times.

It is sometimes convenient to talk about the positive closure of R, denoted R+. It is defined as 1 Ri, or RR .

Using these three operations, we can define the set of regular expressions (res) over an alphabet Σ.

1.if a Σ, then a is also a re denoting the set containing only a.

2.if r and s are res, denoting sets L(r) and L(s) respectively, then

(r) is a re denoting L(r)

r | s is a re denoting the union of L(r) and L(s)

rs is a re denoting the concatenation of L(r) and L(s) r is a re denoting the Kleene closure of L(r).

3. is a re denoting the empty set.

To disambiguate these expressions, we assume that closure has highest precedence, followed by concatenation, followed by union. (Union is sometimes called alternation.)

As a simple example, consider our clumsy English description of the microsyntax of Algol identifiers. As a re, an identifier might be defined as:

1This is not always the case. Dartmouth Basic, an interpreted language from the early 1960’s, only allowed variable names that began with an alphabetic character and had at most one digit following that character. This limited the programmer to two hundred and eighty six variable names. Some implementations simplified the translation by statically mapping each name to one of two hundred and eighty six memory locations.

2.3. CLOSURE PROPERTIES OF RES

23

alpha

(a | b | c | d | e | f | g | h | i | j | k | l | m

digit

 

| n | o | p | q | r | s | t | u | v | w | x | y | z)

(0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)

identifier

alpha (alpha

|

digit)

 

 

Here, we introduced names for the subexpressions alpha and digit. The entire expression could have been written in one-line (with a suitably small font). Where the meaning is clear, we will elide some of the enumerated elements in a definition. Thus, we might write alpha (a | b | c | · · · | z) as a shorthand for the definition of alpha given previously. This allows us to write res more concisely. For example, the Algol-identifier specification now becomes

(a | b | c | · · · | z) ( (a | b | c | · · · | z) | (0 | 1 | 2 | · · · | 9) )

A similar set of rules can be built up to describe the various kinds of numbers.

integer

(+ | − | ) (0 | 1 | 2 | · · · | 9)+

decimal

integer . (0 | 1 | 2 | · · · | 9)

real

(integer | decimal) E integer

In the re real, the letter E is a delimiter that separates the mantissa from the exponent. (Some programming languages use other letters to denote specific internal formats for floating point numbers.)

Notice that the specification for an integer admits an arbitrary number of leading zeros. We can refine the specification to avoid leading zeros, except for the single, standalone zero required to write the number zero.

integer (+ | − | ) (0 | (1 | 2 | 3 | · · · | 9) (0 | 1 | 2 | · · · | 9) )

Unfortunately, the rule for real relied on leading zeros in its exponent, so we must also rewrite that rule as

real

(integer

|

decimal ) E 0 integer, or

 

 

real

(integer

| decimal ) E (0 | 1 | 2 | · · · | 9)+

Of course, even more complex examples can be built using the three operations of regular expressions—union, concatenation, and closure.

2.3Closure Properties of REs

The languages generated by regular expressions have been studied extensively. They have many important properties; some of those properties play an important role in the use of regular expressions to build scanners.

Regular expressions are closed under many operations. For example, given two regular expressions r and s Σ , we know that (r | s) is a regular expression that represents the language

24

CHAPTER 2. LEXICAL ANALYSIS

{w | w L(r) or w L(s)}.

This follows from the definition of regular expressions. Because (r | s) is, by definition, a regular expression, we say that the set of regular expressions is closed under alternation. From the definition of regular expressions, we know that the set of regular expressions is closed under alternation, under concatenation, and under Kleene closure.

These three properties play a critical role in the use of regular expressions to build scanners. Given a regular expression for each of syntactic categories allowed in the source language, the alternation of those expressions is itself a regular expression that describes the set of all valid words in the language. The fact the regular expressions are closed under alternation assures us that the result is a regular expression. Anything that we can do to the simple regular expression for a single syntactic category will be equally applicable to the regular expression for the entire set of words in the language.

Closure under concatenation allows us to build complex regular expressions from simpler ones by concatenating them together. This property seems both obvious and unimportant. However, it lets us piece together res in systematic ways. Closure ensures that rs is a re as long as both r and s are res. Thus, any techniques that can be applied to either r or s can be applied to rs; this includes constructions that automatically generate recognizer implementations from res.

The closure property for Kleene closure (or ) allows us to specify particular kinds of infinite sets with a finite pattern. This is critical; infinite patterns are of little use to an implementor. Since the Algol-identifier rule does not limit the length of the name, the rule admits an infinite set of words. Of course, no program can have identifiers of infinite length. However, a programmer might write identifiers of arbitrary, but finite, length. Regular expressions allow us to write concise rules for such a set without specifying a maximum length.

The closure properties for res introduce a level of abstraction into the construction of micro-syntactic rules. The compiler writer can define basic notions, like alpha and digit, and use them in other res. The re for Algol identifiers

alpha (alpha | digit)

is a re precisely because of the closure properties. It uses all three operators to combine smaller res into a more complex specification. The closure properties ensure that the tools for manipulating res are equally capable of operating on these “composite” res. Since the tools include scanner generators, this issue plays a direct role in building a compiler.

2.4Regular Expressions and Finite Automata

Every regular expression, r, corresponds to an abstract machine that recognizes L(r). We call these recognizers finite automata. For example, in a lexical analyzer for assembly code, we might find the regular expression

r digit digit

2.4. REGULAR EXPRESSIONS AND FINITE AUTOMATA

25

The recognizer for this regular expression could be represented, pictorially, as

digit digit

‘r’ ?

s0 s1 s2

The diagram incorporates all the information necessary to understand the recognizer or to implement it. Each circle represents a state; by convention, the recognizer starts in state s0. Each arrow represents a transition; the label on the transition specifies the set of inputs that cause the recognizer to make that transition. Thus, if the recognizer was in state s0 when it encountered the input character r, it would make the transition to state s1. In state s2, any digit takes the recognizer back to s2. The state s2 is designated as a final state, drawn with the double circle.

Formally, this recognizer is an example of a finite automaton (fa). An fa is represented, mathematically, as a five-tuple (Q, Σ, δ, q0, F ). Q is the set of states in the automaton, represented by circles in the diagram. Σ is the alphabet of characters that can legally occur in the input stream. Typically, Σ is the union of the edge labels in the diagram. Both Q and Σ must be finite. δ : Q × Σ → Q is the transition function for the fa. It encodes the state changes induced by an input character for each state; δ is represented in the diagram by the labeled edges that connect states. The state q0 Q is the starting state or initial state of the fa. F Q is the set of states that are considered final or accepting states. States in F are recognizable in the diagram because they are drawn with a double circle.

Under these definitions, we can see that our drawings of fas are really pictograms. From the drawing, we can infer Q, Σ, δ, q0 and F .

The fa accepts an input string x if and only if x takes it through a series of transitions that leave it in a final state when x has been completely consumed. For an input string ‘r29’, the recognizer begins in s0. On r, it moves to s1. On 2, it moves to s2. On 9, it moves to s2. At that point, all the characters have been consumed and the recognizer is in a final state. Thus, it accepts r29 as a word in L(register).

More formally, we say that the fa (Q, Σ, δ, q0, F ) accepts the string x, composed of characters x1x2x3 . . . xn if and only if

δ(δ(. . . δ(δ(δ(q0, x1), x2), x3) . . . , xn−1), xn) F .

Intuitively, x1x2x3 . . . xn corresponds to the labels on a path through the fas transition diagram, beginning with q0 and ending in some qf Q. At each step, qi corresponds to the label on the ith edge in the path.

In this more formal model, the fa for register names can be written as

26

CHAPTER 2. LEXICAL ANALYSIS

Q = {s0, s1, s2}

Σ = {r, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9}

δ = {( s0, r → s1), ( s1, digit → s2), ( s2, digit → s2)} q0 = s0

F = {s2}

Notice how this encodes the diagram. Q contains the states. Σ contains the edge labels. δ encodes the edges.

Error Transitions What happens if we confront the recognizer for register names with the string ‘s29’ ? State s0 has no transition for ‘s’. We will assume that any unspecified input leads to an error state se . To make the recognizer work, se needs a transition back to itself on every character in Σ.

 

 

 

 

 

 

 

digit

s0 Q

s1

 

s2

 

 

 

‘r’

 

 

digit

 

?

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

‘r’

 

 

 

 

 

Q

 

 

 

 

 

 

 

 

 

 

Q

 

‘r’

 

 

 

 

 

digit Q

 

 

 

 

 

 

 

 

?

 

 

 

 

 

 

sQ

 

+

Σ

 

 

 

 

 

 

 

se

 

 

 

 

 

 

 

 

 

 

 

 

With these additions, if the recognizer reaches se, it remains in se and consumes all of its input. By convention, we will assume the presence of an error state and the corresponding transitions in every fa, unless explicitly stated. We will not, however, crowd the diagrams by drawing them.

A More Precise Recognizer Our simple recognizer can distinguish between r29 and s29. It does, however, have limits. For example, it will accept r999; few computers have been built that have 999 registers! The flaw, however, lies in the regular expression, not in the recognizer. The regular expression specifies the numerical substring as digit digit+; this admits arbitrary strings of digits. If the target machine had just thirty-two registers, named r0 through r31, a carefully crafted regular expression might be used, such as:

r ((0 | 1 | 2) (digit | ) | (4 | 5 | 6 | 7 | 8 | 9) | (3 | 30 | 31)

This expression is more complex than our original expression. It will, however, accept r0, r9, and r29, but not r999. The fa for this expression has more states than the fa for the simpler regular expression.

2.5. IMPLEMENTING A DFA

27

s digit- s

0,1,2 2 3

r 3 0,1 s0 s1 s5 s6

ZZZ

4,5,6,7,8,9 ~Z

s4

Recall, however, that the fa makes one transition on each input character. Thus, the new fa will make the same number of transitions as the original, even though it checks a more complex micro-syntactic specification. This is a critical property: the cost of operating a fa is proportional to the length of the input string, not to the length or complexity of the regular expression that generated the recognizer. Increased complexity in the regular expression can increase the number of states in the corresponding fa. This, in turn, increases the space needed to represent the fa the cost of automatically constructing it, but the cost of operation is still one transition per input character.

Of course, the re for register names is fairly complex and counter-intuitive. An alternative re might be

r0

r00

r1

r01

r2

r02

r3

r03

r4

r04

r5

r05

r6

r06

r7

r07

r8

r08

r9

r09

r10

r11

r12

r13

r14

r15

r16

r17

r18

r19

r20

r21

r22

 

r23

r24

r25

r26

r27

r28

r29

r30

r31

 

This expression is conceptually simpler, but much longer than the previous version. Ideally, we would like for both to produce equivalent automata. If we can develop the technology to produce the same dfa from both these descriptions, it might make sense to write the longer re since its correctness is far more obvious.

2.5Implementing a DFA

Dfas are of interest to the compiler writer because they lead to e cient implementations. For example, we can implement the dfa for

r digit digit

with a straightforward code skeleton and a set of tables that encode the information contained in the drawing of the dfa. Remember that s0 is the start state; that s2 is the sole final state; and that se is an error state. We will assume the existence of a function T : state → {start,normal,final,error} to let the recognizer switch into case logic based on the current state, and encode the transitions into a function δ : state × character → state that implements the transition function. These can both be encoded into tables.

28

char next character; state s0;

call action(char,state);

while ( char = eof ) state δ(state,char);

call action(state, char); char next character;

if T [state] = final then report acceptance;

else

report failure;

CHAPTER 2. LEXICAL ANALYSIS

action(state,char) switch (T [state])

case start:

word char; break;

case normal:

word word + char; break;

case final:

word word + char; break;

case error:

print error message; break;

end

Figure 2.3: A Skeleton Recognizer for “r digit digit

δ

 

r

0,1,2,3,4

other

 

T

 

action

 

 

 

 

5,6,7,8,9

 

 

 

 

 

 

 

 

s0

 

start

 

 

 

 

 

s0

 

s1

se

se

 

 

s1

 

normal

s1

 

se

s2

se

 

 

 

 

s2

 

final

s2

 

se

s2

se

 

 

 

 

se

 

error

se

 

se

se

se

 

 

 

 

 

 

 

Notice that we have compacted the tables to let them fit on the page. We denote this by listing multiple labels at the head of the column. This suggests the kind of compaction that can be achieved with relatively simple schemes. Of course, to use the compressed table, we must translate the input character into a table index with another table lookup.

The tables are interpreted to implement the recognizer; the code to accomplish this is shown in Figure 2.3. The code is quite simple. The code that precedes the while loop implements the recognizer’s actions for state s0. The other states are handled by the case statement in the routine action. (We pulled this out into a separate routine to make the structure of the skeleton parser more apparent.) The while loop iterates until all input is consumed; for each input character, it uses δ to select an appropriate transition, or next state. When it reaches the end of the input string, symbolized by eof , it determines if its last state is an accepting state by checking T [state].

To implement a second dfa, we can simply change the tables used with the skeleton recognizer. In the previous section, we presented the following refinement to the register specification.

r ((0 | 1 | 2) (digit | ) | (4 | 5 | 6 | 7 | 8 | 9) | (3 | 30 | 31)

It restricts register names to the set r0 through r31. The tables in Figure 2.4 encode the recognizer that we showed for this regular expression. Thus, to

Соседние файлы в предмете Электротехника