Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Beginning Regular Expressions 2005.pdf
Скачиваний:
95
Добавлен:
17.08.2013
Размер:
25.42 Mб
Скачать

5

Character Classes

Character classes are used when you want to match any one of a collection of characters. The need for character classes may, for example, occur when you are matching certain parts in a parts catalog or certain names in an employee listing.

Some character classes correspond to widely used collections of characters. For example, the character class [A-Z] corresponds to uppercase ASCII characters, and the character class [0-9] matches a numeric digit.

This chapter looks at the following:

How character classes work

How to use quantifiers with character classes

How to use ranges inside character classes

How to use negated character classes

Introduction to Character Classes

A character class is a nonordered grouping of characters from which one character is chosen to provide a match for a regular expression pattern. If none of the characters specified in the character class matches the character currently being matched, the match fails.

The following pattern containing the character class [yi] would match the surnames Smith and Smyth because the third character of each of those sequences of characters is contained in the specified character class.

Sm[yi]th

Chapter 5

When a character class has no associated quantifier, the pattern specifies that exactly one character from the character class is to be matched. So the following pattern would match pear, peer, and peir but would not match per, because there is no match in per for either of the characters contained in the character class:

pe[aei]r

The term character set is sometimes used to refer to the notion for which I use the term character class. The term character class seems to be more widely used and is the one I use consistently in this chapter and elsewhere in this book.

Examine the following problem definition:

Match an uppercase A, followed by an uppercase B, followed by either the numeric digit 1 or the numeric digit 2, followed by another numeric digit.

To select part numbers AB10 to AB29 given this definition, you could use the following pattern:

AB[12][0123456789]

The first character class, [12], indicates that the third character in a sequence of characters can be the numeric digit 1 or the numeric digit 2. The character class [0123456789] indicates that the fourth character in a sequence of characters must be a numeric digit, 0 through 9.

The sample data is in the file ABPartNumbers.txt:

AB31

AB2D

AB10

AB18

AB44

AB29

AB24

Try It Out Character Class

1.Open OpenOffice.org Writer, and open the test file ABPartNumbers.txt.

2.Use Ctrl+F to open the Find & Replace dialog box.

3.Check the Regular Expressions and Match Case check boxes.

4.Enter the pattern AB[12][0123456789] in the Search For text box, and click the Find All button.

5.Inspect the sample text, shown in Figure 5-1, to see which sequences of characters have been highlighted. Notice that neither of the first two sequences of characters is matched.

106

Character Classes

Figure 5-1

How It Works

The regular expression engine begins matching at the position immediately before the A of AB31. It attempts to match the uppercase A in the pattern against the uppercase A in the sample text. There is a match. It next attempts to match the second character in the pattern, uppercase B, against the next character in the sample text, which is an uppercase B. That too matches. Next, it attempts to match the third component of the pattern (which is the character class [12] rather than a single literal character) against the third character of the sequence, the numeric digit 3. There is no match. Because one component of the pattern fails to match, the entire pattern fails to match.

The sequence of characters AB2D also fails to match. The first two characters in the sequence match against the first two characters, AB, in the pattern. The third character in the sequence of characters, 2, matches against the character class [12]. However, the fourth character in the sequence of characters, D, does not match against the character class [0123456789]. Because one component of the pattern fails to match, the entire pattern fails to match.

However, the sequence of characters AB10 does match. The first character in the sequence of characters, A, matches the first character in the pattern, A. The second character in the sequence of characters, B, matches the second character in the pattern, B. The third character of the sequence of characters, the numeric digit 1, matches the third component of the pattern, the character class [12], because the

107

Chapter 5

numeric digit 1 is contained in the character class. The fourth character in the sequence of characters, the numeric digit 0, matches because the numeric digit 0 is contained in the character class [0123456789].

Choice between Two Characters

You can use a character class for a choice as simple as that between two characters. However, for that scenario you can just as easily use parentheses to enclose two options.

Parentheses and how they can be used in alternation are described in more detail in Chapter 7.

Suppose that you want to select people in a listing represented by the sample document People.txt, shown here:

Cardoza, Fred

Catto, Philipa

Duncan, Jean

Edwards, Neil

England, Elizabeth

Main, Robert

Martin, Jane

Meens, Carol

Patrick, Harry

Paul, Jeanine

Roberts, Clementine

Schmidt, Paul

Sells, Simon

Smith, Peter

Stephens, Sheila

Wales, Gareth

Zinni, Hamish

Assume that all names are laid out as shown, on separate lines, and that the surname is first, followed by a comma, then a space, then the first name. If you wanted to select people whose surname begins with C or D, you could use the following problem definition:

Match an uppercase C or an uppercase D, followed by any number of successive ASCII lowercase alphabetic characters.

The following pattern could be used to express a solution to the problem definition:

[CD][a-z]+

However, that pattern is not specific enough. If you use it to test Roberts, Clementine you will find that there is an undesired match in the first name Clementine, and you want to match last names. So you need a more specific pattern. In this case, you can simply modify the problem definition to the following:

Match an uppercase C, followed by any number of successive ASCII lowercase alphabetic characters, followed by a comma.

This results in a more specific pattern:

[CD][a-z]+,

108

Character Classes

An alternative approach is to use parentheses to express the problem definition with the same results, as shown here:

(C|D)[a-z]+,

Now try it out.

Try It Out

Selecting Specified Surnames

1.Open OpenOffice.org Writer, and open the test file People.txt.

2.Use the Ctrl+F keyboard shortcut to open the Find & Replace dialog box.

3.Check the Regular Expressions and Match Case check boxes.

4.Enter the pattern [CD][a-z]+, in the Search For text box, and click the Find All button.

5.Inspect the results. Figure 5-2 selects all three names in People.txt where the surname begins with C or D. Notice that with the comma included in the regular expression pattern, the test text

Meens, Carol and Roberts, Clementine does not match.

Figure 5-2

109

Chapter 5

6.Delete the final comma from the regular expression pattern.

7.Click the Find All button, and inspect the matches. Notice that now, when the final comma is removed, the character sequences Meens, Carol and Roberts, Clementine are matches.

How It Works

When the regular expression engine begins to match, it starts at the position before the initial C of Cardoza, Fred. It attempts to match the first component of the pattern, the character class [CD], against the first character of the test text, an uppercase C. There is a match. Next, it attempts to match the second component of the pattern, the pattern [a-z]+ (meaning one or more lowercase ASCII characters), against the second and subsequent characters of the test text. Each of the characters a, r, d, o, z, and a matches. The final comma does not match the pattern [a-z]+ but does match the final component of the regular expression pattern, which is a literal comma. So there is a match for each of the components of the regular expression. The uppercase C matches [CD], the sequence of lowercase characters ardoza matches [a-z]+, and the final comma in the test text matches the comma in the regular expression pattern.

When the regular expression engine comes to the position before the C or Carol in the test text Meens, Carol, it attempts to match the [CD] character class as before against the uppercase C. That matches. The following test text, arol, matches the pattern [a-z]+. However, there is no match for the final comma in the regular expression pattern, so there is no match for the entire pattern.

A character class is very flexible and can be changed or extended as needed. For example, you could extend the selection to include people whose surname begins with C or D or S by modifying the pattern as follows:

[CDS]\w+,

Of course, you could also write that using parentheses, as shown here:

(C|D|S) \w+,

In some situations only a single letter differs in correct spellings of words. One example is the spelling of grey (in British English) and gray (in U.S. English).

The problem definition can be expressed as follows:

Match a lowercase g, followed by a lowercase r, followed by a choice of either lowercase e or lowercase a, followed by lowercase y.

A pattern that expresses that problem definition follows:

gr[ae]y

It can also be written as follows:

gr[ea]y

110