Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Beginning Regular Expressions 2005.pdf
Скачиваний:
95
Добавлен:
17.08.2013
Размер:
25.42 Mб
Скачать

Chapter 3

Other Cardinality Operators

Testing for matches only for optional characters can be very useful, as you saw in the colors example, but it would be pretty limiting if that were the only quantifier available to a developer. Most regular expression implementations provide two other cardinality operators (also called quantifiers): the * operator and the + operator, which are described in the following sections.

The * Quantifier

The * operator refers to zero or more occurrences of the pattern to which it is related. In other words, a character or group of characters is optional but may occur more than once. Zero occurrences of the

chunk that precedes the * quantifier should match. A single occurrence of that chunk should also match. So should two occurrences, three occurrences, and ten occurrences. In principle, an unlimited number of occurrences will also match.

Let’s try this out in an example using OpenOffice.org Writer.

Try It Out

Matching Zero or More Occurrences

The sample file, Parts.txt, contains a listing of part numbers that have two alphabetic characters followed by zero or more numeric digits. In our simple sample file, the maximum number of numeric digits is three, but because the * quantifier will match three occurrences, we can use it to match the sample part numbers. If there is a good reason why it is important that a maximum of three numeric digits can occur, we can express that notion by using an alternative syntax, which we will look at a little later in this chapter. Each of the part numbers in this example consists of the sequence of uppercase characters ABC followed by zero or more numeric digits:

ABC

ABC123

ABC12

ABC889

ABC8899

ABC34

We can express what we want to do as follows:

Match an uppercase A. If there is a match, attempt to match an uppercase B. If there is a match, attempt to match an uppercase C. If all three uppercase characters match, attempt to match zero or more numeric digits.

Because all the part numbers begin with the literal characters ABC, you can use the pattern

ABC[0-9]*

to match part numbers that correspond to the description in the problem definition.

62

Simple Regular Expressions

1.

2.

3.

4.

5.

Open OpenOffice.org Writer, and open the sample file, Parts.txt.

Use Ctrl+F to open the Find and Replace dialog box.

Check the Regular Expression check box and the Match Case check box.

Enter the regular expression pattern ABC[0-9]* in the Search For text box.

Click the Find All button, and inspect the matches that are highlighted.

Figure 3-17 shows the matches in OpenOffice.org Writer. As you can see, all of the part numbers match the pattern.

Figure 3-17

How It Works

Before we work through a couple of the matches, let’s briefly look at part of the regular expression pattern, [0-9]*. The asterisk applies to the character class [0-9], which I call a chunk.

Why does the first part number ABC match? When the regular expression engine is at the position immediately before the A of ABC, it attempts to match the next character in the part number with an uppercase

63

Chapter 3

A. Because the first character of the part number ABC is an uppercase A, there is a match. Next, an attempt is made to match an uppercase B. That too matches, as does an attempt to match an uppercase C. At that stage, the first three characters in the regular expression pattern have been matched. Finally, an attempt is made to match the pattern [0-9]*, which means “Match zero or more numeric characters.” Because the character after C is a newline character, there are no numeric digits. Because there are exactly zero numeric digits after the uppercase C of ABC, there is a match (of zero numeric digits). Because all components of the pattern match, the whole pattern matches.

Why does the part number ABC8899 also match? When the regular expression engine is at the position immediately before the A of ABC8899, it attempts to match the next character in the part number with an uppercase A. Because the first character of the part number ABC8899 is an uppercase A, there is a match. Next, attempts are made to match an uppercase B and an uppercase C. These too match. At that stage, the first three characters in the regular expression pattern have been matched. Finally, an attempt is made to match the pattern [0-9]*, which means “Match zero or more numeric characters.” Four numeric digits follow the uppercase C. Because there are exactly four numeric digits after the uppercase C of ABC, there is a match (of four numeric digits, which meets the criterion “zero or more numeric digits”). Because all components of the pattern match, the whole pattern matches.

Work through the other part numbers step by step, and you’ll find that each ought to match the pattern

ABC[0-9]*.

The + Quantifier

There are many situations where you will want to be certain that a character or group of characters is present at least once but also allow for the possibility that the character occurs more than once. The + cardinality operator is designed for that situation. The + operator means “Match one or more occurrences of the chunk that precedes me.”

Take a look at the example with Parts.txt, but look for matches that include at least one numeric digit. You want to find part numbers that begin with the uppercase characters ABC and then have one or more numeric digits.

You can express the problem definition like this:

Match an uppercase A. If there is a match, attempt to match an uppercase B. If there is a match, attempt to match an uppercase C. If all three uppercase characters match, attempt to match one or more numeric digits.

Use the following pattern to express that problem definition:

ABC[0-9]+

Try It Out

Matching One or More Numeric Digits

1.Open OpenOffice.org Writer, and open the sample file Parts.txt.

2.Use Ctrl+F to open the Find and Replace dialog box.

3.Check the Regular Expressions and Match Case check boxes.

4.Enter the pattern ABC[0-9]+ in the Search For text box; click the Find All button; and inspect the matching part numbers that are highlighted, as shown in Figure 3-18.

64

Simple Regular Expressions

Figure 3-18

As you can see, the only change from the result of using the pattern ABC[0-9]* is that the pattern ABC[0-9]+ fails to match the part number ABC.

How It Works

When the regular expression engine is at the position immediately before the uppercase A of the part number ABC, it attempts to match an uppercase A. That matches. Next, subsequent attempts are made to match an uppercase B and an uppercase C. They too match. At that stage, the first three characters in the regular expression pattern have been matched. Finally, an attempt is made to match the pattern [0-9]+, which means “Match one or more numeric characters.” There are zero numeric digits following the uppercase C. Because there are exactly zero numeric digits after the uppercase C of ABC, there is no match (zero numeric digits fails to match the criterion “one or more numeric digits,” specified by the + quantifier). Because the final component of the pattern fails to match, the whole pattern fails to match.

Why does the part number ABC8899 match? When the regular expression engine is at the position immediately before the A of ABC8899, it attempts to match the next character in the part number with an uppercase A. Because the first character of the part number ABC8899 is an uppercase A, there is a match. Next, attempts are made to match an uppercase B and an uppercase C. They too match. At that stage, the first three characters in the regular expression pattern have been matched. Finally, an attempt is made to

65