Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Beginning Regular Expressions 2005.pdf
Скачиваний:
95
Добавлен:
17.08.2013
Размер:
25.42 Mб
Скачать

 

 

Lookahead and Lookbehind

 

 

 

 

 

 

 

Metacharacter

Meaning

 

 

 

 

 

 

 

(?: ... )

Non-capturing grouping

 

(?= ... )

Positive lookahead

 

(?! ... )

Negative lookahead

 

(?<= ... )

Positive lookbehind

 

(?<! ... )

Negative lookbehind

 

 

 

Lookahead

Lookahead allows you to make a match conditional on the matching sequence of characters being followed by (positive lookahead) or not being followed by (negative lookahead) a specified sequence of characters.

Some documentation (the .NET documentation is an example) refers to lookahead as zero-width lookahead assertion. This chapter will generally use the shorter term, lookahead, with the same meaning. Where relevant, the terms positive and negative will be used as qualifiers of the term lookahead.

A key point to appreciate about lookahead is that the characters that occur later than the specified match are not consumed by the regular expression engine.

This can become quite abstract, so a simple example follows to help clarify what the regular expression engine is doing.

Suppose that you have a document containing part numbers of the form alphabetic character, alphabetic character, numeric digit, numeric digit, as follows:

BC99

However, you are interested only in the alphabetic characters. The problem definition can be expressed as follows:

Match two consecutive alphabetic characters if they occur at the beginning of a line and are immediately followed by two numeric digits.

The matching takes place only on the alphabetic characters. The pattern that implements this problem definition never matches any numeric digits. A pattern to implement the problem definition is as follows:

^[A-Za-z]{2}(?=\d\d)

197

Chapter 8

The ^ metacharacter indicates the beginning of line position. The two \w metacharacters indicate ASCII alphabetic characters. The (?=\d\d) pattern is the lookahead, which specifies that after the two alphabetic characters, there must be two numeric digits.

A simple test file, PartNumbers.txt, is shown here:

AB21

AB1

CD8D3

RD/25

Only one of the four parts numbers will match.

Try It Out

Part Numbers Example

1.Open PowerGrep, and in the Search Text area, enter the pattern ^[A-Za-z]{2}(?=\d\d).

2.In the Folder text box, enter the folder name C:\BRegExp\Ch08. Amend the folder name if you downloaded the example code to a different directory.

3.In the File Mask text box, enter PartNumbers.txt, and click the Search button.

4.Inspect the results in the Results area, as shown in Figure 8-1. Notice that only one of the four part numbers matches the pattern.

Figure 8-1

How It Works

Assume that the regular expression engine starts at the position before the A on the first line. It first attempts to match its current position against the ^ metacharacter. There is a match. Next, it attempts to match the first [A-Za-z] character class against the first character on the first line. The uppercase A matches the character class. Next, it attempts to match the character class again against the second character of the test text, B. There is a match. To that point in the processing, the regular expression engine has carried out a match in the way that you have seen previously. However, the pattern (?=\d\d) tells the regular expression engine that it must check what the following sequence of characters is and match

198

Lookahead and Lookbehind

the two alphabetic characters only if two numeric digits, as indicated by the \d\d specified inside the lookahead, are present. The first \d metacharacter matches the 2 of 21. The second \d metacharacter matches the 1 of 21. So there is a sequence of two numeric digits following the two alphabetic characters; the constraint imposed by the lookahead pattern, (?=\d\d), is satisfied. Therefore, the whole pattern matches.

When the regular expression reaches the position at the start of the second line, the test text on that line is AB1. The regular expression engine attempts to match the ^ metacharacter. Because the regular expression engine is at the beginning of line position, there is a match. Subsequent match attempts for the A and B of AB1 are successful. Matching is complete, so the lookahead part of the regular expression pattern is processed. The lookahead (?=\d\d) does not match, because there is only one numeric digit. All the characters you wanted to match have been matched, but the lookahead has failed; therefore, the whole regular expression fails.

Positive Lookahead

Positive lookahead is the process of matching a sequence of characters constrained by a requirement that the sequence must be followed by some other sequence of characters (which is usually different).

As with many other techniques in regular expressions, there is often more than one way to use lookahead with the same result. For example, to match the character sequence State only when it occurs in States, you could use:

(?=States)State

or

State(?=s)

The first option means “Find a position that is followed by the character sequence States. If such a position exists, attempt to match the character sequence State.” The second option means “Match the character sequence State only if it is followed by a lowercase s.” Both patterns will match the same character sequence.

Positive Lookahead — Star Training Example

Now that you understand how to use positive lookahead, let’s revisit the Star Training Company example and improve the specificity of the processing.

The most straightforward way to approach the problem, assuming for the moment that you will use only lookahead, is the following problem definition:

Match the sequence of characters, S, t, a, and r when they are followed by a single space character and then followed by the sequence of characters T, r, a, i, n, i, n, and g.

The following pattern implements the problem definition:

Star(?= Training)

199

Chapter 8

Try It Out

Positive Lookahead — Star Training Example

1.Open PowerGrep, and type the pattern Star(?= Training) in the Search text area.

2.Type the folder name C:\BRegExp\Ch08 in the Folder text box. Amend the folder name if you downloaded the sample files for this chapter to a different location.

3.Type the filename StarOriginal.txt in the File Mask text box, and click the Search button.

4.Inspect the results in the Results area, as shown in Figure 8-2. Notice that all six occurrences of the character sequence Star, which precedes Training (with an intervening space character), are matched. Due to the way that PowerGrep displays text, you will likely need to scroll horizontally to see some of the matching character sequences.

Figure 8-2

How It Works

The regular expression engine starts at the beginning of the test document, attempting to match the character sequence Star. Matching of Star takes place in the normal way. However, each time the character sequence Star is matched, the regular expression engine also looks ahead to find out whether or not Star is followed by a space character and the character sequence Training.

The occurrence of Star on the first line is followed by the specified sequence, so there is a match.

However, in the second line, Star, which is part of Starting, is matched by the pattern Star, but matching fails when the lookahead is evaluated.

Positive Lookahead — Later in Same Sentence

Positive lookahead can find occurrences of two words of interest in the same sentence. This section looks at how you can match a word if a second word occurs later in the same sentence. The assumption is made that the data does not contain a number that includes a decimal point (which is indistinguishable from a period character) and does not include an ellipsis made of a short sequence of period characters.

200

Lookahead and Lookbehind

The test file, Sentence.txt, is shown here:

Here is a sentence where one can look ahead to interesting character sequences.

This sentence does not contain interesting characters.

Here is a sequence of characters.

Which sequence of characters is contained in this sentence?

The problem definition is as follows:

Match the character sequence sentence only when it is followed by the character sequence sequence in the same sentence.

Only one of the lines, the first, contains the sequence of characters sentence with the sequence of characters sequence later in the same sentence.

Try It Out

Positive Lookahead — Later in the Same Sentence

1.Open PowerGrep, and type the pattern sentence(?=.*sequence.*\.) in the Search text area.

2.Type the folder name C:\BRegExp\Ch08 in the Folder text box. Amend the folder name if you downloaded the test files for this chapter to another folder.

3.Type the filename Sentence.txt in the File Mask text box, and click the Search button.

4.Inspect the results displayed in the Results area, as shown in Figure 8-3.

Figure 8-3

How It Works

The regular expression engine looks for the character sequence sentence. If it finds that character sequence, it tests to see if the lookahead condition is satisfied. The lookahead pattern, (?=.*sequence.*\.), tests, from the position that follows the final e of sentence, for the occurrence of the character sequence sequence with any number of intervening characters, as indicated by .*. It then tests for the later occurrence of a period character, as indicated by the \. metacharacter, with any number of intervening characters as indicated by the pattern .*.

201