Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Beginning Regular Expressions 2005.pdf
Скачиваний:
95
Добавлен:
17.08.2013
Размер:
25.42 Mб
Скачать

Character Classes

How It Works

Look at how the first part number, DB992, matches. The regular expression will start matching from the position immediately before the D. First, it attempts to match the first component of the pattern, the character class [ADFG], with the first character of DB992, which is D. That matches. Next, it attempts to match the second component of the pattern, the character class [XBGE], with the second character of the test text, B. That too matches. Next, the third component of the pattern, the character class [0-9] with the quantifier {3}, attempts to match three successive numeric digits. Because the third, fourth, and fifth characters of the test text are the numeric digits 9, 9, and 2, there is a match. Because all components of the regular expression pattern match, the entire pattern matches.

If, however, you wanted not to match a hypothetical sequence of characters AG123, using the first two character classes would give undesired matches because the pattern [ADFG][XBGE] would match the first two characters of AG123.

Hexadecimal Numbers

One situation when character ranges are useful is when identifying hexadecimal numbers. As you probably know, hexadecimal numbers represent numerical values to the base 16 rather than normal, decimal arithmetic where numbers are expressed to base 10. To enable display of numbers from 0 to 15 the alphabetic characters A through F or a through f (either case is acceptable) are used to represent numbers 10 through 15 using a single character. The following table shows decimal numbers and hexadecimal numbers representing 10 through 15, in case you are not familiar with the notation. Numbers from 0 through 9 in hexadecimal are represented in the same way as in decimal numbers.

Decimal

Hexadecimal

 

 

 

 

10

A or a

11

B or b

12

C or c

13

D or d

14

E or e

15

F or f

 

 

Hexadecimal numbers are natural in many computing uses because 16 is 2 to the power of 4. One situation where hexadecimal numbers are used is in defining color values in some HTML/XHTML or Scalable Vector Graphics (SVG) attribute values.

In SVG, for example, a color value is often expressed as three successive two-character hexadecimal numbers. Each sequence of characters is written as a literal #, followed by six characters, each pair of which should be a valid hexadecimal number.

119

Chapter 5

Several values, some of which contain correctly written hexadecimal values and some of which do not, are contained in the file Numbers.txt, shown here:

#DE88D9

#DE88D9

#DG3399

#0099FF

#99FG00

#CCCCCC

#669933

#66330

#8i8824

#902332

#8F8F8F

#2099CC

#88CCFF

#CFE

#994488

#CFEE

Some of the sequences of characters contain characters outside the ranges 0 through 9 and A (or a) through F (or f). Some don’t contain exactly six characters or digits.

You could express the problem definition as follows:

Match a literal # character, followed by matching six successive characters, each of which can represent a hexadecimal number from 0 through 15 (decimal) — that is, 0 through F (hexadecimal).

The following regular expression pattern selects valid character sequences:

#[0-9a-fA-F]{6}

Figure 5-9 shows the result in of applying the pattern to Numbers.txt.

Rather than select valid hexadecimal numbers, you might wish to select all numbers that are not valid for one reason or another. Later in this chapter, you will examine how you might approach that after the concept of negated character ranges is introduced.

IP Addresses

Another example where you might benefit from regular expressions is in using IP addresses. IP addresses are used to locate servers on the World Wide Web. When you type www.WhereIWantToGo.com in your browser, that is translated to an IP address, which takes a form such as 123.2.234.23, where you have groups of one, two, or three digits separated by period characters.

Strictly speaking, values such as 002 are allowed in IP addresses. However, leading zeros are not commonly used. For the purposes of this example, I will assume that leading zeros don’t occur in the sample data.

120

Character Classes

Figure 5-9

Describing the structure of an IP address in English, you might attempt to express the problem definition as follows:

Match between 1 and 3 numeric digits followed by a period character, followed by between 1 and 3 numeric digits, followed by a period character, followed by between 1 and 3 numeric digits, followed by a period character, followed by between 1 and 3 numeric digits.

Based on that description, a first attempt to create a regular expression pattern to identify IP addresses might look like this:

[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}

Remember that you need to use the escape sequence \. to match a period.

121

Chapter 5

The sample file to test is IPLike.txt, which includes numbers that are valid IP addresses and others that are not:

12.12.12.12

255.255.256.255

12.255.12.255

256.123.256.123

8.234.88.55

196.83.83.191

8.234.88,55

88.173.71.66

241.92.88.103

Figure 5-10 shows the results of a match on IPLike.txt in OpenOffice.org Writer using the pattern just mentioned.

Only one of the strings, 8.234.88,55, in IPLike.txt fails to match the pattern because it contains a comma, whereas from the description we are using, it ought to contain a period.

However, so far, the fact that IP addresses may contain numbers with a maximum value of 255 has been overlooked. So although your pattern matches the second string in IPLike.txt, 256.123.256.123, that sequence of characters is not a valid IP address.

The pattern [0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3} matches IP addresses but also matches things that are not valid IP addresses.

Take a closer look at how a valid IP address is made up. First, look at the situation where you a have singledigit number. You could describe those as being “a single numeric digit between 0 and 9.” That fits neatly with the following character class:

[0-9]

That character class matches numbers such as 1, 3, 4, 6, and 9, which is what you want.

In the next scenario you have numbers with two digits. Those are in the range 10 to 99. So again, a character class can be used to express each of the two digits, like this:

[1-9][0-9]

You use the character class [1-9], not [0-9], for the first of the two character classes because numbers less than 10 are written as single-digit numbers and are covered by the previously defined character class, [0-9]. However, the second digit of a two-digit number can be 0, such as in 10 or 50, or 9, such as in 29 or 79, so the character class [0-9] is appropriate for the second of the two digits.

122

Character Classes

Figure 5-10

You have found patterns made up from character classes that match one-digit or two-digit values contained in IP addresses. Now look at a situation where one of the numeric values is a three-digit number. For clarity, this explanation is split into two parts. The first is creating a pattern that matches numbers from 100 to 199; the second is creating a pattern that matches numbers from 200 to 255.

First, a pattern to match numbers from 100 to 199 is shown here:

1[0-9][0-9]

You know that all the numbers you are interested in begin with a 1, so you can include it literally as the first character in the pattern. The second digit can be anything from 0, as in 103 or 106, to 9, as in 191 or 197, so the character class [0-9] is appropriate for the second digit. Similarly, the third digit can be from 0 to 9, so the character class [0-9] is appropriate.

Next, look at the situation where you have three-digit values in the range from 200 to 249. The following pattern matches those values:

2[0-4][0-9]

123

Chapter 5

The first character in the pattern, 2, is the same for all values in that range, such as 202, 226, and 241. The second character in the pattern is a numeric digit that can be represented by the character class [0-4], and the third character in the pattern can be represented by the character class [0-9], as in 203,

228, or 249.

Next, you need a pattern to match values in the range 250 to 255. The following pattern matches those values:

25[0-5]

The first character is always a 2, so it can be written literally. The second character in the pattern is always a 5, so it, too, can be written literally. The third character can be represented by the character class [0-5], which matches 250, 253, and 255, for example.

So let’s bring all that together. To match numbers from 0 to 255, the number can match any of the following patterns:

[0-9]

when it’s a single-digit value, or:

[1-9][0-9]

when it’s a value between 10 and 99, or:

1[0-9][0-9]

when it’s a value between 100 and 199, or:

2[0-4][0-9]

when it’s between 200 and 249, or:

25[0-5]

when it’s between 250 and 255.

If you try to express that in a problem definition, you have a definition that is considerably longer than any you have seen so far:

Match any of the following:

1.If a numeric value is a single digit, match any number from 0 through 9.

2.If the numeric value is a two-digit number, match 1 through 9 for the first character and 0 through 9 for the second character.

124

Character Classes

3.If the numeric value is a three-digit numeric value, it matches if it matches any of the following:

a.Match the numeric digit 1, followed by a numeric digit 0 through 9, followed by a numeric digit 0 through 9, or

b.Match the numeric digit 2, followed by a numeric digit 0 through 4 followed by a numeric digit 0 through 9, or

c.Match the numeric digit 2, followed by a numeric digit 5, followed by a numeric digit 0 through 4.

To put the preceding five patterns together, you need to use parentheses, and because the patterns are mutually exclusive options, use the pipe character, |, to separate one option from another. So to match a single value from 0 to 255 and only values in that range, you have the following pattern:

([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])

Imagine coming back to this six months later, or maintaining someone else’s code if there is no documentation that describes what he or she intended to do. Ideally, you can see the value of good documentation for problems like this.

This has been a bit more complicated than anything tackled so far in this book, so for further practice, try it out in OpenOffice.org Writer.

Try It Out

Attempting to Match a Numeric Value up to 255

1.Open OpenOffice.org Writer, and open the test file IPLike.txt.

2.Open the Find & Replace dialog box, using Ctrl+F, and check the Regular Expressions and Match Case check boxes.

3.Enter the pattern ([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]) in the Search For text box. Be careful not to include any whitespace inside the parentheses.

4.Click the Find All button, and inspect the results, as shown in Figure 5-11. You may be surprised to see that the value 256 is still matched.

How It Works

When you look closely at Figure 5-11, you will see that there are matches in both the second and fourth lines on the values 256. Because you’ve just spent quite a bit of time carefully crafting a regular expression that matches values only up to 255, what is happening?

For a hint about what is happening, let’s switch to the Komodo Regular Expression Toolkit. Apply the following pattern to the test value 256:

([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])

125

Chapter 5

Figure 5-11

Figure 5-12 shows the results.

The first digit, 2, of 256 is matched. In other words, the 2 of 256 matches the first option inside the parentheses. This provides a clue as to what is happening in OpenOffice.org Writer. The first option in the pattern, the character class [0-9], matches the 2 of 256 in IPLike.txt, and that same pattern, from the available pattern options inside the parentheses, also matches the 5 of 256 and the 6 of 256. So in OpenOffice.org Writer, when you seem to have a match for 256, you actually have three separate matches: one for 2, one for 5, and one for 6. The way those three consecutive characters are highlighted makes it look as though there is a problem with the regular expression pattern when, in fact, that isn’t the problem.

What happens if you modify the ordering of the options inside the parentheses? First, modify the regular expression pattern so that the [0-9] is moved to the end:

([1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]|[0-9])

If you try to edit the regular expression pattern inside the Find & Replace dialog box of OpenOffice.org Writer 1.1, you may run into some bugs in the editor, which can lead to overwriting of existing characters or a failure to accept new characters. If you do find that you can’t edit correctly inside the Search For text box, I suggest that you edit the pattern elsewhere, such as in Notepad, for example, and paste the edited pattern into the Search For text box.

126

Character Classes

Figure 5-12

If you then click the Find All button, the values 256 of the sample text is highlighted. If you take time to use the Find button (rather than the Find All button) to step through the matches, you will see that first 25 is matched (that matches the [1-9][0-9] pattern, which is now the first option inside the parentheses) and then 6 matches (which matches the [0-9] character class, which is now the last option inside the parentheses).

Even if you reverse the order of all the options inside the parentheses, all occurrences of 256 are matched:

(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])

What you need to be able to take this example to a successful conclusion is a regular expression pattern that allows you to specify that all characters up to the first period must match in a single chunk, that the digits between two periods must match as a single chunk, and the digits between the final period and the end of the string also match in a single chunk. You will return to this problem in Chapter 6 after you have looked at the meaning and usage of the ^ and $ metacharacters.

Just in case you want to try out a solution right now, here is a pattern that will do what you want — that is, it will not match any IP-like sequence of characters that contain values of 256 or more.

^((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-

9][0-9]|[1-9][0-9]|[0-9])$

The way the preceding pattern works is explained in Chapter 6.

127