Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Beginning Regular Expressions 2005.pdf
Скачиваний:
95
Добавлен:
17.08.2013
Размер:
25.42 Mб
Скачать

Chapter 4

Try It Out

The . Metacharacter Matching a Newline Character

1.Open the Komodo development environment, and click the button for the Komodo Regular Expressions Toolkit.

2.Clear any regular expression and test string in the toolkit.

3.Check the Global check box and the Single-Line Mode check box.

4.Click in the Enter a String to Match Against area. Press the Return key once. This causes the first character in the test area to be a newline character.

5.Enter the . metacharacter in the Enter a Regular Expression area, and inspect the results.

There is pale green highlighting on the first (newline) character in the test text area. The gray area below the Enter a String to Match Against area should read Match succeeded: 0 groups.

How It Works

The regular expression engine matches a newline character, as well as the other characters it normally matches, when the Global and Single-Line Mode check boxes are checked. Modifiers are discussed in more detail later in this chapter. Therefore, when the regular expression engine starts attempts at matching at the position before the initial newline character, the first attempt to match is successful.

Because the period has very broad scope, it risks matching unintended characters, particularly when it is followed by the * or + quantifier, both of which allow unlimited numbers of potentially matching characters. In many situations, a regular expressions engine will match “greedily,” meaning that it will match as many characters as possible. Patterns such as .* and .+ can match many paragraphs or pages of text, which may not be what you intend.

Having looked at what the . metacharacter does, let’s return to the parts inventory problem briefly touched on at the beginning of this chapter.

Matching Variably Structured Part Numbers

The problem definition is as follows:

Match part numbers where the fourth character is an uppercase C and the fifth and sixth characters are numeric digits.

Whether the . metacharacter is an ideal component of a regular expression pattern depends, in part, on the structure of the data. If the data is as shown in the sample file Inventory.txt, you can use the following pattern to satisfy the problem definition

...C[0-9][0-9]

(three periods followed by an uppercase C, followed by two numeric digits), which is equivalent to the following:

.{3}C[0-9][0-9]

I have used the character class [0-9] for numeric digits because this example is tested using OpenOffice.org Writer, which does not support the \d metacharacter to match a numeric digit.

78

Metacharacters and Modifiers

Try It Out

Using the . Metacharacter to Match Inventory

1.Open OpenOffice.org Writer, and open the sample file Inventory.txt.

2.Use Ctrl+F to open the Find and Replace dialog box.

3.Check the Regular Expressions and Match Case check boxes, and enter the pattern ...C[0-9][0-9] in the Search For text box.

4.Click the Find All button to display all matches in highlighted text, and inspect the results, as shown in Figure 4-3. Notice that the second part number is not matched.

Figure 4-3

How It Works

Look at why the pattern ...C[0-9][0-9] matches the part number D99C44 but fails to match the part number CODD29. In the descriptions that follow, I refer to part numbers, but strictly speaking, the regular expression engine matches a sequence of characters because it has no knowledge of what is or is not a part number.

Assuming that the regular expression engine is at the position immediately before the initial D of D99C44, it first attempts to match the . metacharacter with the D. That matches. Next, it attempts to match the second . metacharacter. Because the second character of the part number is 9, the . metacharacter matches. Similarly, the third . metacharacter matches the second 9. The fourth character in the regular expression pattern is an uppercase C. That matches the fourth character of the part number, which is C.

79

Chapter 4

Next, the regular expression engine attempts to match a numeric digit. Because the first 4 of the sequence of characters D99C44 matches the pattern [0-9], there is a match for the fifth character. Finally, an attempt is made to match the second [0-9], which matches because the sixth character is a numeric digit, 4. Because all components of the regular expression pattern match, the pattern as a whole matches. The text is therefore highlighted in OpenOffice.org Writer.

If the regular expression engine is at the position immediately before the initial A of CODD29, it first attempts to match the first . metacharacter with the initial C of CODD29. That matches. Next, it attempts to match the second . metacharacter with the O of CODD29. That also matches. Then it attempts to match the third . metacharacter with the third character in CODD29. That also matches. Next, it attempts to match the uppercase C with the D of CODD29. That does not match. Because one part of the pattern has failed to match, the whole pattern fails to match. Assuming that you clicked the Find All button, the regular expression engine then attempts to find further matches later in the test document.

Matching a Literal Period

Given the existence of the . metacharacter, you cannot use a period as a literal character in a pattern to selectively match a period in a target document. To match a period in a target document, you must escape the period using a backslash:

\.

Try It Out Matching a Literal Period Character

1.Open the Komodo development environment, and click the button to open the Komodo Regular Expression Toolkit.

2.Clear any residual test text and regular expression.

3.In the Enter a String to Match Against area enter the following: This sentence has a period at the end. We will try to match it.

4.In the Enter a Regular Expression area, enter the pattern \. and inspect the results, as shown in Figure 4-4.

How It Works

The regular expression engine starts at the position before the uppercase T of This and attempts to match each character in turn against the pattern \.. The first character that matches is the period that follows the word end.

As you have seen, the . metacharacter matches an extremely wide range of characters. The following sections look at metacharacters that allow a little more specificity, examine metacharacters that match only ASCII alphabetic characters (upperand lowercase A through Z), and that match only numeric digits.

80

Metacharacters and Modifiers

Figure 4-4

The \w Metacharacter

The \w metacharacter matches only characters in the English alphabet, plus numeric digits and the underscore character. Thus, it differs from the . metacharacter because it does not match symbols; punctuation; or, in some implementations, alphabetic characters from languages other than English.

In some settings, the \w metacharacter is interpreted in the context of Unicode rather than ASCII. In those cases, the matching is wider than described in the preceding paragraph.

Try It Out

Matching Using the \w Metacharacter

1.Open the Komodo Regular Expression Toolkit, and clear any residual regular expression and test text.

2.In the Enter a String to Match Against area, type This sentence has a period at the end.

3.In the Enter a Regular Expression area, enter the regular expression \w{3}.

4.Inspect the results in the Enter a String to Match Against area and in the gray area below it. The three characters Thi should be highlighted (in pale green, if you’re looking at it on-screen).

Figure 4-5 shows the results of this step.

81