Добавил:

Andrey Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Санкт-Петербургский государственный электротехнический университет "ЛЭТИ"

Предмет:

Электротехника

Файл:

Cooper K.Engineering a compiler

.pdf

Скачиваний:

Добавлен:

23.08.2013

Размер:

2.31 Mб

Скачать

☆

<<< < Предыдущая 1 2 34 / 364 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 > Следующая >>>

Chapter 2

Lexical Analysis

2.1Introduction

The scanner takes as input a stream of characters and produces as output a stream of words, along with their associated syntactic categories. It aggregates letters together to form words and applies a set of rules to determine whether or not the word is legal in the source language and, if so, its syntactic category. This task can be done quickly and e ciently using a specialized recognizer.

This chapter describes the mathematical tools and the programming techniques that are commonly used to perform lexical analysis. Most of the work in scanner construction has been automated; indeed, this is a classic example of the application of theoretical results to solve an important and practical problem—specifying and recognizing patterns. The problem has a natural mathematical formulation. The mathematics leads directly to e cient implementation schemes. The compiler writer speciﬁes the lexical structure of the language using a concise notation and the tools transform that speciﬁcation into an e cient executable program. These techniques have led directly to useful tools in other settings, like the Unix tool grep and the regular-expression pattern matching found in many text editors and word-processing tools. Scanning is, essentially, a solved problem.

Scanners look at a stream of characters and recognize words. The rules that govern the lexical structure of a programming language, sometimes called its micro-syntax, are simple and regular. This leads to highly e cient, specialized recognizers for scanning. Typically, a compiler’s front end has a scanner to handle its micro-syntax and a parser for its context-free syntax, which is more complex to recognize. This setup is shown in Figure 2.1. Separating microsyntax from syntax simpliﬁes the compiler-writer’s life in three ways.

•The description of syntax used in the parser is written in terms of words and syntactic categories, rather than letters, numbers, and blanks. This lets the parser ignore irrelevant issues like absorbing extraneous blanks, newlines, and comments. These are hidden inside the scanner, where they

20			CHAPTER 2. LEXICAL ANALYSIS
source		words		intermediate
source	scanner	words	parser	intermediate
code	scanner		parser		code
code					code

Front end‘

Figure 2.1: Structure of a typical front end

are handled cleanly and e ciently.

•Scanner construction is almost completely automated. The lexical rules are encoded in a formal notation and fed to a scanner generator. The result is an executable program that produces the input for the parser. Scanners generated from high-level speciﬁcations are quite e cient.

•Every rule moved into the scanner shrinks the parser. Parsing is harder than scanning; the amount of code in a parser grows as the grammar grows. Since parser construction requires more direct intervention from the programmer, shrinking the parser reduces the compiler-writer’s e ort.

As a ﬁnal point, well-implemented scanners have lower overhead (measured by instructions executed per input symbol) than well-implemented parsers. Thus, moving work into the scanner improves the performance of the entire front end.

Our goal for this chapter is to develop the notations for specifying lexical patterns and the techniques to convert those patterns directly into an executable scanner. Figure 2.2 depicts this scenario. This technology should allow the compiler writer to specify the lexical properties at a reasonably high level and leave the detailed work of implementation to the scanner generator—without sacriﬁcing e ciency in the ﬁnal product, the compiler.

First, we will introduce a notation, called regular expressions, that works well for specifying regular expressions. We will explore the properties of regular expressions and their relationship to a particular kind of recognizer, called a ﬁnite automaton. Next, we will develop the techniques and methods that allow us to automate the construction of e cient scanners from regular expressions. We show some additional results that relate regular expressions and automata, and conclude with an example of how complex the task of lexical analysis can be in Fortran, a language designed before we had a good understanding of the mathematics of lexical analysis.

2.2Specifying Lexical Patterns

Before we can build a scanner, we need a way of specifying the micro-syntax of the source language—of specifying patterns that describe the words in the language. Some parts are quite easy.

•Punctuation marks like colons, semi-colons, commas, parentheses, and square brackets can be denoted by their unique character representations:

2.2. SPECIFYING LEXICAL PATTERNS							21
source	-			words		intermediate
source				words		intermediate
		scanner		-	parser	-	code
code		scanner			parser
code

			6		front end
lexical
lexical	-	scanner
patterns		generator

Figure 2.2: Automatic scanner generation

: ; , ( ) [ ]

•Keywords, like if, then, and integer are equally simple. These words have a unique spelling, so we can represent them as literal patterns—we simply write them down.

Some simple concepts have more complicated representations. For example, the concept of a blank might require a small grammar.

WhiteSpace → WhiteSpace blank

|WhiteSpace tab

| blank

|tab

where blank and tab have the obvious meanings.

Thus, for many words in the source language, we already have a concise representation—we can simply write down the words themselves. Other parts of a programming language are much harder to specify. For these, we will write down rules that can generate all of the appropriate strings.

Consider, for example, a pattern to specify when a string of characters forms a number. An integer might be described as a string of one or more digits, beginning with a digit other than zero, or as the single digit zero. A decimal number is simply an integer, followed by a decimal point, followed by a string of zero or more digits. (Notice that the part to the left of the decimal point cannot have leading zeros unless it is a zero, while the fractional part must be able to have leading zeros.) Real numbers and complex numbers are even more complicated. Introducing optional signs (+ or -) adds yet more clauses to the rules.

The rules for an identiﬁer are usually somewhat simpler than those for a number. For example, Algol allowed identiﬁer names that consisted of a single alphabetic character, followed by zero or more alphanumeric characters. Many languages include special characters such as the ampersand (&), percent sign (%), and underscore ( ) in the alphabet for identiﬁer names.

22	CHAPTER 2. LEXICAL ANALYSIS

The complexity of lexical analysis arises from the simple fact that several of the syntactic categories used in a typical programming language contain an e ectively inﬁnite set of words. In particular, both numbers and identiﬁers usually contain sets large enough to make enumeration impractical.1 To simplify scanner construction, we need to introduce a powerful notation to specify these lexical rules: regular expressions.

A regular expression describes a set of strings over the characters contained in some alphabet, Σ, augmented with a character that represents the empty string. We call the set of strings a language. For a given regular expression, r, we denote the language that it speciﬁes as L(r). A regular expression is built up from three simple operations:

Union – the union of two sets R and S, denoted R S, is the set

{s | s R or s S}.

Concatenation – the concatenation of two sets R and S, denoted RS, is the set {st | s R and t S}. We will sometimes write R2 for RR, the concatenation of R with itself, and R3 for RRR (or RR2).

Closure – the Kleene closure of a set R, denoted R , is the set ∞0 Ri. This is just the concatenation of L with itself, zero or more times.

It is sometimes convenient to talk about the positive closure of R, denoted R+. It is deﬁned as ∞1 Ri, or RR .

Using these three operations, we can deﬁne the set of regular expressions (res) over an alphabet Σ.

1.if a Σ, then a is also a re denoting the set containing only a.

2.if r and s are res, denoting sets L(r) and L(s) respectively, then

(r) is a re denoting L(r)

r | s is a re denoting the union of L(r) and L(s)

rs is a re denoting the concatenation of L(r) and L(s) r is a re denoting the Kleene closure of L(r).

3. is a re denoting the empty set.

To disambiguate these expressions, we assume that closure has highest precedence, followed by concatenation, followed by union. (Union is sometimes called alternation.)

As a simple example, consider our clumsy English description of the microsyntax of Algol identiﬁers. As a re, an identiﬁer might be deﬁned as:

1This is not always the case. Dartmouth Basic, an interpreted language from the early 1960’s, only allowed variable names that began with an alphabetic character and had at most one digit following that character. This limited the programmer to two hundred and eighty six variable names. Some implementations simpliﬁed the translation by statically mapping each name to one of two hundred and eighty six memory locations.

2.3. CLOSURE PROPERTIES OF RES				23
alpha	→	(a \| b \| c \| d \| e \| f \| g \| h \| i \| j \| k \| l \| m
digit		\| n \| o \| p \| q \| r \| s \| t \| u \| v \| w \| x \| y \| z)
	→	(0 \| 1 \| 2 \| 3 \| 4 \| 5 \| 6 \| 7 \| 8 \| 9)
identiﬁer	→	alpha (alpha	\|	digit)

Here, we introduced names for the subexpressions alpha and digit. The entire expression could have been written in one-line (with a suitably small font). Where the meaning is clear, we will elide some of the enumerated elements in a deﬁnition. Thus, we might write alpha → (a | b | c | · · · | z) as a shorthand for the deﬁnition of alpha given previously. This allows us to write res more concisely. For example, the Algol-identiﬁer speciﬁcation now becomes

(a | b | c | · · · | z) ( (a | b | c | · · · | z) | (0 | 1 | 2 | · · · | 9) )

A similar set of rules can be built up to describe the various kinds of numbers.

integer	→ (+ \| − \| ) (0 \| 1 \| 2 \| · · · \| 9)+
decimal	→	integer . (0 \| 1 \| 2 \| · · · \| 9)
real	→	(integer \| decimal) E integer

In the re real, the letter E is a delimiter that separates the mantissa from the exponent. (Some programming languages use other letters to denote speciﬁc internal formats for ﬂoating point numbers.)

Notice that the speciﬁcation for an integer admits an arbitrary number of leading zeros. We can reﬁne the speciﬁcation to avoid leading zeros, except for the single, standalone zero required to write the number zero.

integer → (+ | − | ) (0 | (1 | 2 | 3 | · · · | 9) (0 | 1 | 2 | · · · | 9) )

Unfortunately, the rule for real relied on leading zeros in its exponent, so we must also rewrite that rule as

real	→	(integer	\|	decimal ) E 0 integer, or

real	→	(integer	\| decimal ) E (0 \| 1 \| 2 \| · · · \| 9)+

Of course, even more complex examples can be built using the three operations of regular expressions—union, concatenation, and closure.

2.3Closure Properties of REs

The languages generated by regular expressions have been studied extensively. They have many important properties; some of those properties play an important role in the use of regular expressions to build scanners.

Regular expressions are closed under many operations. For example, given two regular expressions r and s Σ , we know that (r | s) is a regular expression that represents the language

24	CHAPTER 2. LEXICAL ANALYSIS

{w | w L(r) or w L(s)}.

This follows from the deﬁnition of regular expressions. Because (r | s) is, by definition, a regular expression, we say that the set of regular expressions is closed under alternation. From the deﬁnition of regular expressions, we know that the set of regular expressions is closed under alternation, under concatenation, and under Kleene closure.

These three properties play a critical role in the use of regular expressions to build scanners. Given a regular expression for each of syntactic categories allowed in the source language, the alternation of those expressions is itself a regular expression that describes the set of all valid words in the language. The fact the regular expressions are closed under alternation assures us that the result is a regular expression. Anything that we can do to the simple regular expression for a single syntactic category will be equally applicable to the regular expression for the entire set of words in the language.

Closure under concatenation allows us to build complex regular expressions from simpler ones by concatenating them together. This property seems both obvious and unimportant. However, it lets us piece together res in systematic ways. Closure ensures that rs is a re as long as both r and s are res. Thus, any techniques that can be applied to either r or s can be applied to rs; this includes constructions that automatically generate recognizer implementations from res.

The closure property for Kleene closure (or ) allows us to specify particular kinds of inﬁnite sets with a ﬁnite pattern. This is critical; inﬁnite patterns are of little use to an implementor. Since the Algol-identiﬁer rule does not limit the length of the name, the rule admits an inﬁnite set of words. Of course, no program can have identiﬁers of inﬁnite length. However, a programmer might write identiﬁers of arbitrary, but ﬁnite, length. Regular expressions allow us to write concise rules for such a set without specifying a maximum length.

The closure properties for res introduce a level of abstraction into the construction of micro-syntactic rules. The compiler writer can deﬁne basic notions, like alpha and digit, and use them in other res. The re for Algol identiﬁers

alpha (alpha | digit)

is a re precisely because of the closure properties. It uses all three operators to combine smaller res into a more complex speciﬁcation. The closure properties ensure that the tools for manipulating res are equally capable of operating on these “composite” res. Since the tools include scanner generators, this issue plays a direct role in building a compiler.

2.4Regular Expressions and Finite Automata

Every regular expression, r, corresponds to an abstract machine that recognizes L(r). We call these recognizers ﬁnite automata. For example, in a lexical analyzer for assembly code, we might ﬁnd the regular expression

r digit digit

2.4. REGULAR EXPRESSIONS AND FINITE AUTOMATA

The recognizer for this regular expression could be represented, pictorially, as

digit digit

‘r’ ?

s0 s1 s2

The diagram incorporates all the information necessary to understand the recognizer or to implement it. Each circle represents a state; by convention, the recognizer starts in state s0. Each arrow represents a transition; the label on the transition speciﬁes the set of inputs that cause the recognizer to make that transition. Thus, if the recognizer was in state s0 when it encountered the input character r, it would make the transition to state s1. In state s2, any digit takes the recognizer back to s2. The state s2 is designated as a ﬁnal state, drawn with the double circle.

Formally, this recognizer is an example of a ﬁnite automaton (fa). An fa is represented, mathematically, as a ﬁve-tuple (Q, Σ, δ, q0, F ). Q is the set of states in the automaton, represented by circles in the diagram. Σ is the alphabet of characters that can legally occur in the input stream. Typically, Σ is the union of the edge labels in the diagram. Both Q and Σ must be ﬁnite. δ : Q × Σ → Q is the transition function for the fa. It encodes the state changes induced by an input character for each state; δ is represented in the diagram by the labeled edges that connect states. The state q0 Q is the starting state or initial state of the fa. F Q is the set of states that are considered ﬁnal or accepting states. States in F are recognizable in the diagram because they are drawn with a double circle.

Under these deﬁnitions, we can see that our drawings of fas are really pictograms. From the drawing, we can infer Q, Σ, δ, q0 and F .

The fa accepts an input string x if and only if x takes it through a series of transitions that leave it in a ﬁnal state when x has been completely consumed. For an input string ‘r29’, the recognizer begins in s0. On r, it moves to s1. On 2, it moves to s2. On 9, it moves to s2. At that point, all the characters have been consumed and the recognizer is in a ﬁnal state. Thus, it accepts r29 as a word in L(register).

More formally, we say that the fa (Q, Σ, δ, q0, F ) accepts the string x, composed of characters x1x2x3 . . . xn if and only if

δ(δ(. . . δ(δ(δ(q0, x1), x2), x3) . . . , xn−1), xn) F .

Intuitively, x1x2x3 . . . xn corresponds to the labels on a path through the fas transition diagram, beginning with q0 and ending in some qf Q. At each step, qi corresponds to the label on the ith edge in the path.

In this more formal model, the fa for register names can be written as

26	CHAPTER 2. LEXICAL ANALYSIS

Q = {s0, s1, s2}

Σ = {r, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9}

δ = {( s0, r → s1), ( s1, digit → s2), ( s2, digit → s2)} q0 = s0

F = {s2}

Notice how this encodes the diagram. Q contains the states. Σ contains the edge labels. δ encodes the edges.

Error Transitions What happens if we confront the recognizer for register names with the string ‘s29’ ? State s0 has no transition for ‘s’. We will assume that any unspeciﬁed input leads to an error state se . To make the recognizer work, se needs a transition back to itself on every character in Σ.

							digit
s0 Q			s1					s2
	‘r’				digit			?
	‘r’				digit			?

		‘r’
	Q
	Q			‘r’
	digit Q
	digit Q			?
	sQ			+		Σ
			se			Σ

With these additions, if the recognizer reaches se, it remains in se and consumes all of its input. By convention, we will assume the presence of an error state and the corresponding transitions in every fa, unless explicitly stated. We will not, however, crowd the diagrams by drawing them.

A More Precise Recognizer Our simple recognizer can distinguish between r29 and s29. It does, however, have limits. For example, it will accept r999; few computers have been built that have 999 registers! The ﬂaw, however, lies in the regular expression, not in the recognizer. The regular expression speciﬁes the numerical substring as digit digit+; this admits arbitrary strings of digits. If the target machine had just thirty-two registers, named r0 through r31, a carefully crafted regular expression might be used, such as:

r ((0 | 1 | 2) (digit | ) | (4 | 5 | 6 | 7 | 8 | 9) | (3 | 30 | 31)

This expression is more complex than our original expression. It will, however, accept r0, r9, and r29, but not r999. The fa for this expression has more states than the fa for the simpler regular expression.

2.5. IMPLEMENTING A DFA

s digit- s

0,1,2 2 3

r 3 0,1 s0 s1 s5 s6

ZZZ

4,5,6,7,8,9 ~Z

Recall, however, that the fa makes one transition on each input character. Thus, the new fa will make the same number of transitions as the original, even though it checks a more complex micro-syntactic speciﬁcation. This is a critical property: the cost of operating a fa is proportional to the length of the input string, not to the length or complexity of the regular expression that generated the recognizer. Increased complexity in the regular expression can increase the number of states in the corresponding fa. This, in turn, increases the space needed to represent the fa the cost of automatically constructing it, but the cost of operation is still one transition per input character.

Of course, the re for register names is fairly complex and counter-intuitive. An alternative re might be

r0	r00	r1	r01	r2	r02	r3	r03	r4	r04	r5
r05	r6	r06	r7	r07	r8	r08	r9	r09	r10	r11
r12	r13	r14	r15	r16	r17	r18	r19	r20	r21	r22
	r23	r24	r25	r26	r27	r28	r29	r30	r31

This expression is conceptually simpler, but much longer than the previous version. Ideally, we would like for both to produce equivalent automata. If we can develop the technology to produce the same dfa from both these descriptions, it might make sense to write the longer re since its correctness is far more obvious.

2.5Implementing a DFA

Dfas are of interest to the compiler writer because they lead to e cient implementations. For example, we can implement the dfa for

r digit digit

with a straightforward code skeleton and a set of tables that encode the information contained in the drawing of the dfa. Remember that s0 is the start state; that s2 is the sole ﬁnal state; and that se is an error state. We will assume the existence of a function T : state → {start,normal,ﬁnal,error} to let the recognizer switch into case logic based on the current state, and encode the transitions into a function δ : state × character → state that implements the transition function. These can both be encoded into tables.

char ← next character; state ← s0;

call action(char,state);

while ( char = eof ) state ← δ(state,char);

call action(state, char); char ← next character;

if T [state] = ﬁnal then report acceptance;

else

report failure;

CHAPTER 2. LEXICAL ANALYSIS

action(state,char) switch (T [state])

case start:

word ← char; break;

case normal:

word ← word + char; break;

case ﬁnal:

word ← word + char; break;

case error:

print error message; break;

end

Figure 2.3: A Skeleton Recognizer for “r digit digit ”

δ	r	0,1,2,3,4	other	T	action

		5,6,7,8,9
				s0	start

s0	s1	se	se
				s1	normal
s1	se	s2	se
				s2	ﬁnal
s2	se	s2	se
				se	error
se	se	se	se

Notice that we have compacted the tables to let them ﬁt on the page. We denote this by listing multiple labels at the head of the column. This suggests the kind of compaction that can be achieved with relatively simple schemes. Of course, to use the compressed table, we must translate the input character into a table index with another table lookup.

The tables are interpreted to implement the recognizer; the code to accomplish this is shown in Figure 2.3. The code is quite simple. The code that precedes the while loop implements the recognizer’s actions for state s0. The other states are handled by the case statement in the routine action. (We pulled this out into a separate routine to make the structure of the skeleton parser more apparent.) The while loop iterates until all input is consumed; for each input character, it uses δ to select an appropriate transition, or next state. When it reaches the end of the input string, symbolized by eof , it determines if its last state is an accepting state by checking T [state].

To implement a second dfa, we can simply change the tables used with the skeleton recognizer. In the previous section, we presented the following reﬁnement to the register speciﬁcation.

r ((0 | 1 | 2) (digit | ) | (4 | 5 | 6 | 7 | 8 | 9) | (3 | 30 | 31)

It restricts register names to the set r0 through r31. The tables in Figure 2.4 encode the recognizer that we showed for this regular expression. Thus, to

<<< < Предыдущая 1 2 34 / 364 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 > Следующая >>>

Соседние файлы в предмете Электротехника

#
23.08.2013168.53 Кб8CompactPCI specification short form.Rev 2.1.1997.pdf
#
23.08.20132.86 Mб15Computer arithmetic lectures.pdf
#
23.08.20132.04 Mб321Conklin E.K.Forth programmer's handbook.2000.pdf
#
23.08.201384.88 Кб11Constantinides C.AOP considered harmful.pdf
#
23.08.201357.95 Кб121Construction and use of broadband transformers.pdf
#
23.08.20132.31 Mб49Cooper K.Engineering a compiler.pdf
#
23.08.20134.82 Mб19Cooper M.Advanced Bash scripting guide Rev1.4.2002.pdf
#
23.08.2013916.67 Кб13Cooper M.Advanced bash-scripting guide.2002.pdf
#
23.08.201356.71 Кб13Cortes L.Designing a graphical user interface.1997.pdf
#
23.08.2013642.66 Кб21Coulson D.Mastering IPTables.pdf
#
23.08.201329.44 Кб4cover.pdf

r0	r00	r1	r01	r2	r02	r3	r03	r4	r04	r5
r05	r6	r06	r7	r07	r8	r08	r9	r09	r10	r11
r12	r13	r14	r15	r16	r17	r18	r19	r20	r21	r22
	r23	r24	r25	r26	r27	r28	r29	r30	r31

r0	r00	r1	r01	r2	r02	r3	r03	r4	r04	r5
r05	r6	r06	r7	r07	r8	r08	r9	r09	r10	r11
r12	r13	r14	r15	r16	r17	r18	r19	r20	r21	r22
	r23	r24	r25	r26	r27	r28	r29	r30	r31

r0	r00	r1	r01	r2	r02	r3	r03	r4	r04	r5
r05	r6	r06	r7	r07	r8	r08	r9	r09	r10	r11
r12	r13	r14	r15	r16	r17	r18	r19	r20	r21	r22
	r23	r24	r25	r26	r27	r28	r29	r30	r31