Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Beginning Regular Expressions 2005.pdf
Скачиваний:
95
Добавлен:
17.08.2013
Размер:
25.42 Mб
Скачать

PowerGREP

Figure 14-19

How It Works

First, let’s look at the pattern Star(?=\.). First, the pattern Star is matched literally. Because the default behavior of PowerGREP is case insensitive, the character sequences star and Star will match. In fact, STAR and sTar would match too, although neither is present in the test text. However, a constraint is applied on the matching by the lookahead. The pattern (?=\.) is a lookahead that means after Star is matched literally, matching may fail if Star (of whatever case) is not followed immediately by the period character, as indicated by the \. metacharacter.

The pattern (?<=with )Star includes a lookbehind that specifies the character sequence Star (whatever case) matches only if it is preceded by the character sequence with followed by a space character.

Longer Examples

This section looks at some longer examples that apply some of the regular expression functionality found in PowerGREP. One of the most useful aspects of PowerGREP is that it finds matches across multiple text files. For reasons of space, the examples use only two files, each of which is short.

Finding HTML Horizontal Rule Elements

This example aims to find all occurrences of the HTML rule element, <hr>, across multiple documents.

A first attempt at a problem definition would be as follows:

Match all HTML/XHTML horizontal rule elements.

Clearly, you need to understand the permitted structure of an hr element to refine this further.

The form may be as simple as:

<hr>

343

Chapter 14

which can also be written as uppercase. The latter is often found in HTML. Or it can have attributes in HTML style, without enclosing quotation marks:

<hr width=50% color=#990066 size=4 />

Or it can have paired quotation marks around attribute values, as in XHTML:

<hr width=”50%” color=”#990066” size=”4” />

Or it can have paired apostrophes:

<hr width=’50 %’ color=’#990066’ size=’4’ />

Notice, too, that in the XHTML form there is a forward slash before the right-angled bracket at the end of the element.

A more detailed attempt at a problem definition would be the following:

Match a < character followed by the character sequence hr (either case), followed by optional whitespace characters, followed by zero or more characters, followed by optional whitespace characters, followed by an optional forward slash, followed by a > character.

A pattern corresponding to the preceding problem definition is shown here:

<hr *.* */?>

The simple sample documents are shown here. First, HorizRule1.html:

<html>

<head>

<title>Horizontal Rule 1</title> </head>

<body>

<p>This file contains a horizontal rule with no attributes.</p> <hr />

</body>

</html>

Then HorizRule2.html:

<html>

<head>

<title>Horizontal Rule 1</title> </head>

<body>

<p>This file contains a horizontal rule with three attributes.</p> <hr width=”50%” color=”#990066” size=”4” />

</body>

</html>

344

PowerGREP

Try It Out

Horizontal Rules

1.Open PowerGREP, and ensure that the Regular Expression check box is checked.

2.In the Search text area, type the pattern <hr *.* */?>.

3.Ensure that the Folder text box contains C:\BRegExp\Ch14 or adapt the path if you downloaded code to a different directory.

4.In the File Mask text box, type Horiz*.html; click the Search button; and inspect the results, as shown in Figure 14-20.

Figure 14-20

How It Works

The pattern <hr *.* */?> matches a < character followed by the character sequence hr (either case), followed by optional space characters, followed by zero or more characters, followed by zero or more space characters, followed by an optional forward-slash character, followed by a > character.

If you simply want to find all correctly structured hr elements, this pattern should be close to 100 percent sensitive. If the element is spread over several lines:

<hr

width=”50%”

color=”#990066” size=”4” />

the pattern could be usefully modified to the following:

<hr\s*.*\s*/?>

The \s metacharacter ensure that tab characters or newline characters are also matched.

The pattern will also find some incorrectly formed character sequences that are likely intended to be hr elements For example:

<hr width==”50%” />

345

Chapter 14

has two consecutive = signs, which is not allowed. Using the .* pattern, you could match all sorts of illegal character sequences. Although this lowers the specificity of the pattern, it might be useful because it would ensure that all hr elements were matched, even if they contained slight syntax errors.

If the files of interest contain HTML or XHTML markup, this type of loss of specificity is unlikely to be a significant problem.

Matching Time Example

This example looks at how you can match data that makes up a time of day. The first attempt at a problem definition can be expressed as follows:

Match any time of day, whether expressed as 12-hour clock notation or 24-hour clock notation.

To refine what’s needed, you must fully understand each of the notations and how it is written.

The 12-hour notation might have values like this:

9:31 am

or:

09:31am

or:

09:31 pm

or:

09:31pm

An optional first digit can be a 0 or 1. When there is a 0, the next digit can be 0 to 9 inclusive, but when the first digit is a 1, the next digit can only be a 0, 1, or 2 in 12-hour notation.

The following pattern would match hours up to 09:

[0]?[0-9]

Hours from 10 to 12 would be matched by the following pattern:

1[0-2]

So for the part of the pattern before the colon character, you can use the following pattern:

([0]?[0-9]|1[0-2])

If you test that out on illegal “times” such as 18:88pm, it will match, but that is a problem that goes away after you add a colon character at the end of the pattern.

346

PowerGREP

Matching the remainder of the 12-hour time is straightforward:

:[0-5][0-9] ?[ap]m

Putting those parts together, you have the following pattern to match times in 12-hour time notation:

\b([0]?[0-9]|1[0-2]):[0-5][0-9] ?[ap]m

Twenty-four-hour time can be expressed using the following pattern:

([01][0-9]|2[0-4]):[0-5][0-9]

Putting those two patterns together using alternation, you have the following pattern:

(\b([0]?[0-9]|1[0-2]):[0-5][0-9] ?[ap]m|([01][0-9]|2[0-4]):[0-5][0-9])

Test file Time1.txt contains a range of 12-hour–format times:

08:22 pm 08:37 am 19:88 am 12:00 am 11:39pm 7:28 am 8:19 am

Test file Time2.txt contains a range of 24-hour–format times:

06:31

19:15

18:12

23:59

00:03

19:54

03:00

10:49

Try It Out Matching Times

1.Open PowerGREP, and ensure that the Regular Expression check box is checked. First, you will match 12-hour times in Time1.txt.

2.In the Search text area, enter the pattern \b([0]?[0-9]|1[0-2]):[0-5][0-9] ?[ap]m.

3.In the Folder text box, type C:\BRegExp\Ch14 or adapt the path according to your folder structure.

4.In the File Mask text box, type Time1.txt; click the Search button; and inspect the results, as shown in Figure 14-21. Notice that all the legal times in the test file are matched, but the illegal “time” 19:88 am is not matched.

Next, you will attempt to match the 24-hour–format times in Time2.txt.

347

Chapter 14

Figure 14-21

5.Edit the content of the Search text area to ([01][0-9]|2[0-4]):[0-5][0-9].

6.Edit the file mask to Time2.txt; click the Search button; and inspect the results, as shown in Figure 14-22. Notice that all 24-hour–format times are matched.

Figure 14-22

Finally, you will attempt to match times in the formats in both Time1.txt and Time2.txt.

7.Edit the content of the Search text area to (\b([0]?[0-9]|1[0-2]):[0-5][0-9] ?[ap]m|([01][0-9]|2[0-4]):[0-5][0-9]).

8.Edit the file mask to Time*.txt; click the Search button; and inspect the results, as shown in Figure 14-23. Notice that all valid 12-hour–format and 24-hour–format times in the test files are matched.

348