Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Beginning Regular Expressions 2005.pdf
Скачиваний:
95
Добавлен:
17.08.2013
Размер:
25.42 Mб
Скачать

Chapter 13

Quantifiers

Support for quantifiers in findstr is limited. The * quantifier is supported with the standard meaning of zero or more occurrences. However, neither the ? quantifier nor the + quantifier is supported; neither is the {n,m} quantifier notation supported.

The test files Order1.txt and Order2.txt show how the * quantifier can be used.

The content of Order1.txt is shown here:

This is an order for Part No. ABC123.

Blah blah. As easy as ABC.

2004/08/20

The content of Order2.txt is here:

This is an order for Part No. ABC456.

Blah blah.

2003/07/18

For the purposes of this example, the part number is the focus of interest. In many regular expression implementations you would use ABC\d{3} or ABC[0-9]{3} to match exactly three digits, but findstr does not support that syntax.

Try It Out

The * Quantifier

1.Open a command window, and type the following command at the command prompt:

findstr /n “ABC [0-9]*” Order*.txt

2.Inspect the results returned, as shown in Figure 13-6. Notice that three lines contain a match. The second of the displayed lines is undesired because the occurrence of ABC with no following numeric digit is not a part number.

Figure 13-6

3.To match the desired number of numeric digits, exactly three, use the following pattern:

ABC[0-9][0-9][0-9]

310

Regular Expressions Using findstr

4.At the command line, enter the following command:

findstr /n “ABC[0-9][0-9][0-9]” Orders*.txt

Figure 13-7 shows the results.

Figure 13-7

How It Works

After Step 2, the two lines that contain part numbers consisting of the character sequence ABC followed by three numeric digits are matched, which is what you want. However, the second line in Orders1.txt is also matched, because the pattern [0-9]* matches zero or more occurrences of the character class that matches numeric digits. Because ABC in As easy as ABC. has zero occurrences of a numeric digit, the pattern ABC[0-9]* is matched, because the character sequence ABC is present together with zero occurrences of a numeric digit.

Back references, lookahead, and lookbehind are not supported in the findstr utility.

Character Classes

As you saw in an earlier example in this chapter, the character class [0-9] is supported in findstr. In fact, the character class [0-9], or one of the alternative ways of defining a character class, [0123456789], is needed because findstr does not support the \d metacharacter.

The following text, contained in the file PartNums.txt, is the test file:

ABC123

DEF890

GHI234

HKO838

RUV991

ILR246

UVW991

ADF274

DRX119

In findstr ranges are supported, as are negated character classes.

311

Chapter 13

Try It Out

Character Classes

1.Open a command window, and navigate to the directory containing the file PartNums.txt.

2.Type the following at the command line:

findstr /n “A[A-Z][A-Z][0-9][0-9][0-9]” PartNums.txt

3.Inspect the results, as shown in Figure 13-8. The lines containing part numbers that begin with uppercase A, have two uppercase letters following, and have three numeric digits are displayed.

Figure 13-8

Because of the way that findstr works, you could have used a simpler pattern, A[A-Z] [A-Z][0-9], given the sample data. If there were part numbers such as ABC1 in the test text, the preceding pattern would match lines containing part numbers like that, which may not be what you want.

4.Type the following command at the command line:

findstr /n “A[A-Z][A-Z][0-9]” PartNums.txt

5.Inspect the results. (Notice that the same lines are matched.)

6.If you want to match part numbers that begin with an uppercase A but that do not have an uppercase B as the second character in the part number, you can use the negated character class [^B] to achieve that.

At the command line, type the following command:

findstr /n “A[^B][A-Z][0-9]” PartNums.txt

7.Inspect the results (notice that Line 1 no longer matches), as shown in Figure 13-9.

Figure 13-9

How It Works

In Step 2, the pattern A[A-Z][A-Z][0-9][0-9][0-9] is used. The A matches uppercase A literally. Because only Line 1 and Line 15 contain a part number beginning with A, only those lines are possible matches for the rest of the regular expression. The character class [A-Z] matches any alphabetic character,

312

Regular Expressions Using findstr

matching B on Line 1 and D on Line 15. The second occurrence of the character class [A-Z] in the regular expression matches C on Line 1 and F on Line 15. The three character classes [0-9][0-9][0-9] match three successive numeric digits, 123 on Line 1 and 274 on Line 15. So there are matches on lines 1 and 15.

In Step 6, the pattern A[^B][A-Z][0-9] is used. The initial A matches on lines 1 and 15 as before. The character class [^B] matches any character except uppercase B. So there is no match for A[^B] on Line 1. However, on Line 15, D is a match for [^B], so matching continues on Line 15. The [A-B] pattern matches the F on Line 15, and [0-9][0-9][0-9] matches 274 on Line 15. So the only match is on Line 15.

There is a risk in having a character class such as [^B] in a pattern, because that is almost equivalent to the dot character. So if a malformed part number A$C123 were in the test file, it would match the pattern [A-Z][^B][A-Z][0-9][0-9][0-9]. If the intent was that any uppercase character except B was desired, a more specific character class would be [AC-Z]. So the regular expression would be

A[AC-Z][A-Z][0-9][0-9][0-9]

with [AC-Z] having the same meaning as [ACDEFGHIJKLMNOPQRSTUVWXYZ].

Word-Boundar y Positions

The findstr utility supports separate metacharacters that match the beginning-of-word position and the end-of-word position. The \< metacharacter matches the beginning-of-word position, and the \> metacharacter matches the end-of-word position.

A test file, Word.txt, has the following content:

Swords are sharp, typically.

Words are powerful things. They can wound.

Churchill is a byword for wartime persistence.

Do you have a favorite word?

His surname is Answord.

Wordsworth was a famous English poet.

Word by word is, typically, not a good method of translation.

Notice that the character sequence word occurs at the beginning or end of a sequence of alphabetic characters or embedded inside a longer character sequence. Notice, too, that sometimes an uppercase character is part of word or Word, so you must take care in how you use case-sensitive or case-insensitive search.

Try It Out

Beginningand End-of-Word Positions

1.Open a command window, and navigate to the directory containing the Word.txt test file.

2.At the command prompt, enter the following command:

findstr /n “Word” Word.txt

313

Chapter 13

3.Inspect the results, as shown in Figure 13-10. Notice that only three of the seven lines containing text are displayed. This is so because the default behavior of findstr is case-sensitive matching.

Figure 13-10

4.To ensure that all occurrences of the character sequence word are displayed, you can use the /i command-line switch.

At the command line, enter the following command:

findstr /n /i “Word” Word.txt

5.Inspect the results. Now all seven lines containing text are displayed. So you can be confident that all occurrences of the character sequence word are now displayed.

6.Next, let’s look at the effect of the beginning-of-word position metacharacter, \<.

At the command prompt, enter the following command:

findstr /n /i “\<Word” Word.txt

7.Inspect the results, as shown in Figure 13-11. As you can see, only four of the seven lines that contain text are displayed. Each of the lines contains the character sequence word or Word (remember the matching is case insensitive) with that character sequence at the beginning of an alphabetic character sequence (in effect, at the beginning of what you would typically call a “word”).

Figure 13-11

8.You can add the end-of-word position metacharacter, \>, to the regular expression to make the matching more specific, matching only when the character sequence word or Word is preceded by a beginning-of-word position and followed by an end-of-word position.

At the command prompt, enter the following command:

findstr /n /i “\<Word\>” Word.text

9.Inspect the results, as shown in Figure 13-12. Now only two lines are displayed. On each line the character sequence word is actually just that — a word. Strictly speaking, the beginning-of-word position and end-of-word position metacharacters mark the beginning and end of an alphabetic sequence, respectively. For many practical purposes, they signify the beginning and end of a word.

314

Regular Expressions Using findstr

Figure 13-12

Beginningand End-of-Line Positions

The findstr utility offers two quite distinct ways to specify that matching is to take place at the beginning or end of a line. First, there are the /b and /e switches, which specify matching at the beginning and end of a line, respectively. Second, there are the ^ and $ metacharacters.

The content of the test file, Low.txt, is shown here:

Low is the opposite of high.

A Ferrari isn’t usually thought of as slow.

Slow, slow, quick, quick, slow

Slow, slow, quick, quick, slow.

Allow me to to pass please.

Lowering sky over a blackened sea.

Try It Out Beginningand End-of-Line Positions

1.Open a command window, and navigate to the directory containing the file Low.txt.

2.At the command prompt, enter the following command:

findstr /n /i “Low” Low.txt

3.Inspect the results. All six lines that contain text are displayed because the character sequence low, matched case insensitively (notice the /i switch), is present on all six lines.

4.Next, test the /b switch, which limits matching to the beginning of a line.

At the command line, enter the following command:

findstr /n /i /b “Low” Low.txt

5.Inspect the results, as shown in Figure 13-13. Now only two lines are displayed, each of which has the character sequence Low as its first three characters.

Figure 13-13

315