Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Beginning Regular Expressions 2005.pdf
Скачиваний:
95
Добавлен:
17.08.2013
Размер:
25.42 Mб
Скачать

Chapter 3

match the pattern [0-9]+, which means “Match one or more numeric characters.” Four numeric digits follow the uppercase C of ABC, so there is a match (of four numeric digits, which meets the criterion “one or more numeric digits”). Because all components of the pattern match, the whole pattern matches.

Before moving on to look at the curly-brace quantifier syntax, here’s a brief review of the quantifiers already discussed, as listed in the following table:

Quantifier

Definition

 

 

?

0 or 1 occurrences

*

0 or more occurrences

+

1 or more occurrences

 

 

These quantifiers can often be useful, but there are times when you will want to express ideas such as “Match something that occurs at least twice but can occur an unlimited number of times” or “Match something that can occur at least three times but no more than six times.”

You also saw earlier that you can express a repeating character by simply repeating the character in a regular expression pattern.

The Curly-Brace Syntax

If you want to specify large numbers of occurrences, you can use a curly-brace syntax to specify an exact number of occurrences.

The {n} Syntax

Suppose that you want to match part numbers with sequences of characters that have exactly three numeric digits. You can write the pattern as:

ABC[0-9][0-9][0-9]

by simply repeating the character class for a numeric digit. Alternatively, you can use the curly-brace syntax and write:

ABC[0-9]{3}

to achieve the same result.

Most regular expression engines support a syntax that can express ideas like that. The syntax uses curly braces to specify minimum and maximum numbers of occurrences.

66

Simple Regular Expressions

The {n,m} Syntax

The * operator that was described a little earlier in this chapter effectively means “Match a minimum of zero occurrences and a maximum occurrence, which is unbounded.” Similarly, the + quantifier means “Match a minimum of one occurrence and a maximum occurrence, which is unbounded.”

Using curly braces and numbers inside them allows the developer to create occurrence quantifiers that cannot be specified when using the ?, *, or + quantifiers.

The following subsections look at three variants that use the curly brace syntax. First, let’s look at the syntax that specifies “Match zero or up to [a specified number] of occurrences.”

{0,m}

The {0,m} syntax allows you to specify that a minimum of zero occurrences can be matched (specified by the first numeric digit after the opening curly brace) and that a maximum of m occurrences can be matched (specified by the second numeric digit, which is separated from the minimum occurrence indicator by a comma and which precedes the closing curly brace).

To match a minimum of zero occurrences and a maximum of one occurrence, you would use the pattern:

{0,1}

which has the same meaning as the ? quantifier.

To specify matching of a minimum of zero occurrences and a maximum of three occurrences, you would use the pattern:

{0,3}

which you couldn’t express using the ?, *, or + quantifiers.

Suppose that you want to specify that you want to match the sequence of characters ABC followed by a minimum of zero numeric digits or a maximum of two numeric digits.

You can semiformally express that as the following problem definition:

Match an uppercase A. If there is a match, attempt to match an uppercase B. If there is a match, attempt to match an uppercase C. If all three uppercase characters match, attempt to match a minimum of zero or a maximum of two numeric digits.

The following pattern does what you need:

ABC[0-9]{0,2}

The ABC simply matches a sequence of the corresponding literal characters. The [0-9] indicates that a numeric digit is to be matched, and the {0,2} is a quantifier that indicates a minimum of zero occurrences of the preceding chunk (which is [0-9], representing a numeric digit) and a maximum of two occurrences of the preceding chunk is to be matched.

67

Chapter 3

Try It Out

Match Zero to Two Occurrences

1.Open OpenOffice.org Writer, and open the sample file Parts.txt.

2.Use Ctrl+F to open the Find and Replace dialog box.

3.Check the Regular Expressions and Match Case check boxes.

4.Enter the regular expression pattern ABC[0-9]{0,2} in the Search For text box; click the Find All button; and inspect the matches that are displayed in highlighted text, as shown in Figure 3-19.

Figure 3-19

Notice that on some lines, only parts of a part number are matched. If you are puzzled as to why that is, refer back to the problem definition. You are to match a specified sequence of characters. You haven’t specified that you want to match a part number, simply a sequence of characters.

68

Simple Regular Expressions

How It Works

How does it work with the match for the part number ABC? When the regular expression engine is at the position immediately before the uppercase A of the part number ABC, it attempts to match an uppercase A. That matches. Next, an attempt is made to match an uppercase B. That too matches. Next, an attempt is made to match an uppercase C. That too matches. At that stage, the first three characters in the regular expression pattern have been matched. Finally, an attempt is made to match the pattern [0-9]{0,2}, which means “Match a minimum of zero and a maximum of two numeric characters.” Zero numeric digits follow the uppercase C in ABC. Because there are exactly zero numeric digits after the uppercase C of ABC, there is a match (zero numeric digits matches the criterion “a minimum of zero numeric digits” specified by the minimum-occurrence specifier of the {0,2} quantifier). Because the final component of the pattern matches, the whole pattern matches.

What happens when matching is attempted on the line that contains the part number ABC8899? Why do the first five characters of the part number ABC8899 match? When the regular expression engine is at the position immediately before the A of ABC8899, it attempts to match the next character in the part number with an uppercase A and finds is a match. Next, an attempt is made to match an uppercase B. That too matches. Then an attempt is made to match an uppercase C, which also matches. At that stage, the first three characters in the regular expression pattern have been matched. Finally, an attempt is made to match the pattern [0-9]{0,2}, which means “Match a minimum of zero and a maximum of two numeric characters.” Four numeric digits follow the uppercase C. Only two of those numeric digits are needed for a successful match. Because there are four numeric digits after the uppercase C of ABC, there is a match (of two numeric digits, which meets the criterion “a maximum of two numeric digits”), but the final two numeric digits of ABC8899 are not needed to form a match, so they are not highlighted. Because all components of the pattern match, the whole pattern matches.

{n,m}

The minimum-occurrence specifier in the curly-brace syntax doesn’t have to be 0. It can be any number you like, provided it is not larger than the maximum-occurrence specifier.

Let’s look for one to three occurrences of a numeric digit. You can specify this in a problem definition as follows:

Match an uppercase A. If there is a match, attempt to match an uppercase B. If there is a match, attempt to match an uppercase C. If all three uppercase characters match, attempt to match a minimum of one and a maximum of three numeric digits.

So if you wanted to match one to three occurrences of a numeric digit in Parts.txt, you would use the following pattern:

ABC[0-9]{1,3}

Figure 3-20 shows the matches in OpenOffice.org Writer. Notice that the part number ABC does not match, because it has zero numeric digits, and you are looking for matches that have one through three numeric digits. Notice, too, that only the first three numeric digits of ABC8899 form part of the match.

The How It Works explanation in the preceding section for the {0,m} syntax should be sufficient to help you understand what is happening in this example.

69

Chapter 3

Figure 3-20

{n,}

Sometimes, you will want there to be an unlimited number of occurrences. You can specify an unlimited maximum number of occurrences by omitting the maximum-occurrence specifier inside the curly braces.

To specify at least two occurrences and an unlimited maximum, you could use the following problem definition:

Match an uppercase A. If there is a match, attempt to match an uppercase B. If there is a match, attempt to match an uppercase C. If all three uppercase characters match, attempt to match a minimum of two occurrences and an unlimited maximum occurrences of three numeric digits.

You can express that using the following pattern:

ABC[0-9]{2,}

Figure 3-21 shows the appearance in OpenOffice.org Writer. Notice that now all four numeric digits in ABC8899 form part of the match, because the maximum occurrences that can form part of a match are unlimited.

70