Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Beginning Regular Expressions 2005.pdf
Скачиваний:
95
Добавлен:
17.08.2013
Размер:
25.42 Mб
Скачать

Chapter 6

This chapter looks at how to do the following:

Use the ^ metacharacter, which matches the position at the beginning of a string or a line

Use the $ metacharacter, which matches the position at the end of a string or a line

Use the \< and \> metacharacters to match the beginning and end of a word, respectively

Use the \b metacharacter, which matches a word boundary (which can occur at the beginning of a word or at the end of a word)

String, Line, and Word Boundaries

Metacharacters that allow you to create patterns that match sequences of characters that occur at specific positions can be very useful.

For example, suppose that you wanted to find all lines that begin with the word The. With the techniques you have seen and used in earlier chapters, you can readily create a literal pattern to match the sequence of characters The, but with those techniques you haven’t been able to specify where the

sequence of characters occurs in the text, nor whether it is a whole word or forms part of a longer word. The relevant pattern, written as The, would match sequences of characters such as There, Then, and so on at the beginning of a sentence in addition to the word The and would also match parts of personal or business names such as Theodore or Theatre.

Similarly, assuming that you used the pattern The in a case-insensitive mode, you would also (possibly as an undesired side effect) match sequences of characters such as the in the word lathe. At other times, you might want to find a sequence of characters only when they occur at the end of a word (again for example, the the in lathe).

The ^ and $ metacharacters, which are used to specify a position in relation to the beginning and end of a line or string, are discussed and demonstrated first.

The ^ Metacharacter

The ^ metacharacter causes matching to target characters that occur immediately after the beginning of a line or string.

So the pattern.

The

when applied to the test text.

The Thespian Theatre opens at 19:00.

would match the sequence of characters The in the words The, Thespian, and Theatre.

144

String, Line, and Word Boundaries

However, the same pattern preceded by the ^ metacharacter

^The

when applied to the same test text would match only the sequence of characters The in the word The because that sequence of characters occurs immediately after the start of the string.

The ^ metacharacter, when used outside a character class, does not have the negation meaning that it has when used as the first character inside a character class.

Try It Out

Theatre Example

Use the very simple test text in the file Theatre.txt:

The Thespian Theatre opens at 19:00.

1.Open PowerGrep, and check the Regular Expression check box.

2.Enter the pattern The in the Search text box.

3.Enter C:\BRegExp\Ch06 in the Folder text box.

4.Enter Theatre.txt in the File Mask text box.

5.Click the Search button, and inspect the results in the Results area, as shown in Figure 6-1. Notice that the information in the Results area indicates three matches for the pattern The.

Figure 6-1

6.Edit the regular expression pattern so that it reads ^The.

7.Click the Search button, and inspect the results in the Results area, as shown in Figure 6-2. Notice that there is now only one match, in contrast to the three matches before you edited the regular expression pattern.

145

Chapter 6

Figure 6-2

How It Works

The regular expression engine starts at the position before the first character in the test file. The first metacharacter in the pattern, the ^ metacharacter, is matched against the regular expression engine’s current position. Because the regular expression engine is at the beginning of the file, the condition specified by the ^ metacharacter is satisfied, so the regular expression engine can proceed to attempt to match the other characters in the regular expression pattern. The next character in the pattern, the literal uppercase T, is matched against the first character in the test file, which is uppercase T. There is a match, so the regular expression engine attempts to match the next character in the pattern, lowercase h, against the second character in the test text, which is also lowercase h. The literal h in the pattern matches the literal h in the test text. Then the regular expression engine attempts to match the literal e in the pattern against the third character in the test text, lowercase e. There is a match. Because all components of the regular expression match, the entire regular expression matches.

If the regular expression attempts a match when the current position is anything other than the position before the first character of the test text, matching fails on that first metacharacter, ^. Therefore, the pattern as a whole cannot match. Matching fails except at the beginning of the test text.

The ^ Metacharacter and Multiline Mode

In the preceding example, the test text is a single line, so you were able to examine the use of the ^ metacharacter without bothering about whether the ^ metacharacter would match the beginning of the test text or the beginning of each line, because the two concepts were the same. However, in several tools and languages, it is possible to modify the behavior of the ^ metacharacter so that it matches the position before the first character of each line or only at the beginning of the first line of the test file.

When using the Komodo Regular Expression Toolkit, for example, the following test text.

This

Then

will fail to find a match when the pattern is as follows:

^The

146

String, Line, and Word Boundaries

Figure 6-3 shows the failure to match.

Figure 6-3

However, if you check the Multi-Line Mode check box, the sequence of characters The on the second line is highlighted and in the gray area below the message Match succeeded: 0 groups is displayed, as you can see in Figure 6-4.

Figure 6-4

147

Chapter 6

When multiline mode is used, the position after a Unicode newline character is treated in the same way as the position that comes at the beginning of the test file. A Unicode newline character matches any of the characters or character combinations that can be used to express the notion of a newline.

Not all programming languages support multiline mode. How individual programming languages treat this issue is discussed and, where appropriate, demonstrated in later chapters that deal with individual programming languages.

Try It Out The ^ Metacharacter and Multiline Mode

This exercise uses the test file TheatreMultiline.txt:

The Thespian Theatre opens at 19:00.

Then theatrical people enter the building.

They greatly enjoy the performance.

The interval is the time for liquid refreshment.

Notice that each line begins with the sequence of characters The.

Some tools, such as PowerGrep, are in multiline mode by default, as shown here.

1.Open PowerGrep, and check the Regular Expressions check box.

2.Enter the regular expression pattern ^The in the Search text box.

3.Enter C:\BRegExp\Ch06 in the Folder text box. Adjust this if you chose to put the download files in a different folder.

4.Enter TheatreMultiline.txt in the File Mask text box.

5.Click the Search button, and inspect the results in the Results area, as shown in Figure 6-5. Notice the character sequence The at the beginning of each line is highlighted as a match, indicating the default behavior of multiline mode.

Figure 6-5

148