Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Beginning Regular Expressions 2005.pdf
Скачиваний:
95
Добавлен:
17.08.2013
Размер:
25.42 Mб
Скачать

Chapter 7

How It Works

First, consider some aspects of the situation after Step 4.

The text Doctor in Line 1 matches because the pattern Doctor is the first option inside the parentheses. Each literal character in the pattern matches the corresponding literal character in the test text.

The text Dr in Line 3 matches because the pattern Dr, the third option in the parentheses, matches the test text character for character.

In Line 4, the sequence of two characters, Dr., matches.

The Dr of Drive is matched — because it matches the third option inside the parentheses.

Now let’s look at the change in matches after adding the end-of-word boundary metacharacter after the closing parenthesis.

Previously, the Dr of Drf on Line 2 matched. Now there is no match because, although the Dr of Drf continues to be matched by the third option inside the parentheses, the following character is an f, and the position before the f is not an end-of-word word-boundary position.

On Line 4, there is now a two-character sequence that matches. The first option inside the parentheses does not match, so matching is attempted using the second option inside the parentheses. That matches the first and second characters on Line 4. The literal period on Line 4 is not a word character, so the position following the r of Dr is an end-of-word word boundary. The boundary occurs because of the period that follows it, not because of the space character that follows the period.

Unexpected Alternation Behavior

When using alternation, you may sometimes observe behavior that you don’t expect. This is particularly likely to happen if you have options of unequal length with the shorter option on the left and with the shorter option included in the longer option. That may sound confusing, so let’s look at an example.

Suppose that you want to match either the single lowercase character a or the lowercase character sequence ab. You could express that desire in the following pattern:

(a|ab)

Notice that the shorter option, a, is on the left and that it is also part of the longer option, ab. So the conditions described at the beginning of this section are both satisfied.

The sample file, ab.txt, is shown here:

a

ab

ac ab

ba

bab

182

Parentheses in Regular Expressions

Notice that on three lines, the sequence of characters ab is present.

Try It Out

Unequal Alternation

First, try it out, using the following pattern:

(a|ab)

1.Open the file ab.txt in OpenOffice.org Writer.

2.Open the Find & Replace dialog box by pressing Ctrl+F, and check the Regular Expressions and Match Case check boxes.

3.Enter the regular expression pattern (a|ab) in the Search For text box.

4.Click the Find All button, and inspect the results, as shown in Figure 7-6.

Figure 7-6

Notice that each of the highlighted matches is only a single character in length and matches the lowercase character a.

183

Chapter 7

5.Now reverse the alternation options and use the following pattern:

(ab|a)

Edit the pattern in the Search For text box to read (ab|a).

6.Click the Find All button, and inspect the results in Figure 7-7. Notice that there are now three matches that are two characters in length and that match the sequence of lowercase characters ab.

Figure 7-7

How It Works

On Line 2, using the pattern (a|ab), there is a match, but only of a single character. Assume that the regular expression engine starts at the position immediately before the a. It attempts a match of the first option a against the first character on the line, also an a. There is a match for that character. Because the first option is a single character, an option matches. The regular expression engine doesn’t attempt to match the second option. It moves to the position after the match it has found (the position between a and b). It then attempts to match again, but neither option, a or ab, matches the character b, so there is no match.

Using the pattern (a|ab), there will never be a match for the sequence of characters ab, because the first option will always match an a, meaning that the second option is never evaluated.

184