Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Beginning Regular Expressions 2005.pdf
Скачиваний:
95
Добавлен:
17.08.2013
Размер:
25.42 Mб
Скачать

Chapter 4

Regular Expression Metacharacters

You saw in Chapter 3 how literal characters can be combined with quantifiers to create useful but fairly simple regular expression patterns. However, literal characters are pretty restrictive in what they match. Sometimes, it is desirable or necessary to allow more flexible matching. Several metacharacters match a class of characters rather than simply a single literal character. That wider scope can be very useful.

Many of the metacharacters referred to and demonstrated in this chapter consist of two characters. The term metasequence is sometimes used to refer to such pairs of characters that, taken together, convey the meaning of a metacharacter. I use the terms metacharacter and metasequence interchangeably.

For example, consider a parts inventory, Inventory.txt, such as the following:

D99C44

A9DC55

CODD29

RT2C23

MNZC55

UVCC83

Notice the variability in how the first three characters of the sample part numbers are structured. For example, the first part number has an alphabetic character followed by two numeric digits. However, the second part number has a single alphabetic character followed by a single numeric digit, followed by a single alphabetic character. The techniques you have used previously won’t allow you to specify a suitable regular expression pattern, because the structure of a part number is too variable to allow you to easily address the problem using literal characters in a regular expression pattern. The task you want to carry out is to achieve matches to correspond to the following problem definition:

Match part numbers where the fourth character is an uppercase C and the fifth and sixth characters are numeric digits.

If the data is simple, with a relatively small number of options for any individual character, it might be possible to provide a solution using the alternation techniques described in Chapter 7. However, for the purposes of this chapter, assume that the data is so varied that other techniques should be used.

Thinking about Characters and Positions

One of the important basic concepts that you need to grasp is the difference between a character and a position.

To make the distinction between a character and a position clear, look at the following sample text:

This is a simple sentence.

74

Metacharacters and Modifiers

The first character in the sample text is the uppercase T of This. However, there is a position immediately before the uppercase T. The position is not visible and does not match any of the literal characters discussed in Chapter 3. However, there are metacharacters that match a position, such as the ^ metacharacter, which matches the position immediately before the uppercase T in the sample text. Metacharacters that match positions rather than characters are introduced in detail in Chapter 6.

The second character in the sample text is the lowercase h of This. Between the initial uppercase T and the lowercase h, there is a position. Often, such positions between the letters of a sequence of characters (in other words, positions inside words) are not of specific interest to a developer. However, positions at the beginning of a string, at the end of a string, and at the beginning and end of a sequence of alphabetic characters are often of more interest to developers, which is why there are metacharacters that correspond to such positions. The so-called word-boundary metacharacters (strictly speaking, they match the boundaries of a sequence of alphabetic or alphanumeric characters) match a position between an alphabetic character and a nonalphabetic character. In many situations, those boundaries will correspond to the boundaries of a word. Those metacharacters are introduced in Chapter 6.

Metacharacters that match classes of characters are also very useful, and it is those that this chapter tackles.

The Period (.) Metacharacter

The period is one of the most broadly scoped metacharacters. It can match any alphabetic character, whether lowercase or uppercase, as well as any numeric digit. This can be an advantage, because the . metacharacter will match almost anything, which can be useful if you aren’t too concerned about exactly what you match or how many matches you end up with. The disadvantage of the . metacharacter is the same — it will match almost anything. For example, in a search-and-replace operation, replacing the sequence of characters that match the . metacharacter can be very dangerous, with results similar to, but potentially wider in scope than, the replacement of startling by Moontling that you saw in the Star Training Company example in Chapter 1.

Try It Out

The Period (.) Metacharacter

Using the Komodo Regular Expression Toolkit, you can experiment with using the period and then entering alphabetic and numeric characters as test text. Remember that the Komodo Regular Expression Toolkit matches only the first occurrence of any character.

1.Open the Komodo development environment.

2.Click the button for the Komodo Regular Expressions Toolkit, and clear any regular expression and test string in the toolkit.

3.Enter a test string in the Enter a String to Match Against area. The test string is Andrew.

4.Enter a period in the Enter a Regular Expression area of the toolkit, and inspect the result, which is displayed immediately below the Enter a String to Match Against area.

The result in this case is Match succeeded: 0 groups. The concept of groups is discussed in Chapter 7.

The . metacharacter matches any alphabetic character used in English, any numeric digit, whitespace characters such as the space character, and a very large number of alphabetic characters used in languages other than English. Figure 4-1 shows the . metacharacter in the Komodo Regular Expression Toolkit matching an uppercase A.

75

Chapter 4

Figure 4-1

How It Works

When the . metacharacter occurs in a regular expression pattern, the regular expression engine attempts to match it against any uppercase or lowercase English alphabetic character or any numeric digit. In addition, a very large number of non–English-language characters will match.

The regular expression engine begins attempting to find a match at the position immediately before the initial A of Andrew. The first character of the test text, A, is tested as a possible match for the . metacharacter. It matches. So the initial A is outlined in pale green, indicating that it is the first match.

The . metacharacter also matches alphabetic characters in languages other than English.

Try It Out

The . Metacharacter Matching Non-English Characters

If you have closed the Komodo Regular Expression Toolkit, follow all of the following steps. If you have kept the toolkit open, start at Step 2.

1.Open the Komodo development environment, and click the button for the Komodo Regular Expressions Toolkit.

2.Clear any regular expression and/or test string in the toolkit.

3.Open the Windows Character Map. In Windows XP, you can do that by selecting Start All Programs Accessories System Tools and, finally, selecting Character Map.

4.Click once on the scroll bar to the right of the Character Map window. Click the uppercase character (omega), and you should see something similar to that shown in Figure 4-2.

5.With the uppercase selected, click the Select button. The character should appear in the Character Map window’s Characters to Copy text box.

76

Metacharacters and Modifiers

Figure 4-2

6.Click the Copy button in the Character Map window.

7.Enter a test string in the Enter a String to Match Against area of the Komodo Regular Expression

Toolkit by clicking in the Enter a String to Match Against area and pressing Ctrl+V to paste. The test string is .

8.Enter a period in the Enter a Regular Expression area of the toolkit, and inspect the result, which is displayed immediately below the Enter a String to Match Against area. Notice, too, that the uppercase omega is highlighted in pale green on-screen, indicating that it is a match for the . metacharacter.

How It Works

The regular expression engine attempts to match the . metacharacter against any character that is not a newline. An attempt at matching begins at the position immediately before the uppercase omega. The first character, the uppercase omega, matches the . metacharacter. Because the uppercase omega is a character that isn’t a newline, there is a match. Because the entire regular expression is matched (there is only a single metacharacter on this occasion), matching is complete and successful.

Referring back to Figure 4-2, you can see the . metacharacter matching the Greek uppercase letter omega.

You can also try the . metacharacter with any numeric digit or sequence of numeric digits — for example, 234 — and you will see that the . metacharacter matches any numeric digit from 0 through 9.

Using the . metacharacter with any English text is very straightforward. In most circumstances, it will match anything except a newline. However, the matching characteristics of the . metacharacter can be modified to match a newline. In the Komodo Regular Expression Toolkit, this can be done using the single-line mode.

77