Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Beginning Regular Expressions 2005.pdf
Скачиваний:
95
Добавлен:
17.08.2013
Размер:
25.42 Mб
Скачать

Chapter 5

Reverse Ranges in Character Classes

The ranges you have looked at so far follow alphabetic or numeric order. However, it is possible, at least using some tools, to write ranges that are in reverse alphabetic or numeric order, for which I use the term reverse ranges.

When you use character ranges inside a character class, you must be careful if you attempt to use a reverse range because different products and languages are inconsistent in how they handle that syntax.

For example, examine the following regular expression:

[t-r]ight

OpenOffice.org Writer will interpret the regular expression pattern not as you might expect, selecting r, s, and t as the initial characters in the desired character sequence, but interpreting the t and r as literal characters in a character class and ignoring the hyphen. You can see this in action in Figure 5-13. Notice that the character sequence -ight, which is included in the file Light2.txt, is not selected by the regular expression pattern.

Figure 5-13

128

Character Classes

However, in PowerGrep, the regular expression pattern [t-r]ight won’t compile and produces the error shown in Figure 5-14.

Figure 5-14

There is, typically, no advantage in attempting to use reverse ranges in character classes, and I suggest that you avoid using these.

A Potential Range Trap

Suppose that you want to allow for different separators in dates occurring in a document or set of documents. Among the issues this problem throws up is a possible trap in expressing character ranges.

As a first test document, we will use Dates.txt, shown here:

2004-12-31 2001/09/11 2003.11.19 2002/04/29 2000/10/19 2005/08/28 2006/09/18

129

Chapter 5

As you can see, in this file the dates are in YYYY/MM/DD format, but sometimes the dates use the hyphen as a separator, sometimes the forward slash, and sometimes the period. Your task is to select all occurrences of sequences of characters that represent dates (assume for this example that dates are expressed only using digits and separators and are not expressed using names of months, for example).

So if you wanted to select all dates, whether they use hyphens, forward slashes, or periods as separators, you might try a regular expression pattern like this:

(20|19)[0-9]{2}[.-/][01][0-9][.-/][0123][0-9]

In the character class [.-/], which you attempt to use to match the separator, the sequence of characters (period followed by hyphen followed by forward slash) is interpreted as the range from the period to the forward slash. However, as you can see in the top row of Figure 5-15, the hyphen is U+002D, and the period (U+002E) is the character immediately before the forward slash (U+002F). So, undesirably, the pattern .-/ specifies a range that contains only the period and forward-slash characters.

Figure 5-15

Characters can be expressed using Unicode numeric references. The period is U+002E; uppercase A is U+0041. The Windows Character Map shows this syntax for characters if you hover the mouse over characters of interest.

130

Character Classes

To use the hyphen without creating a range, the hyphen should be the first character in the character class:

[-./]

This gives a pattern that will match each of the sample dates in the file Dates.txt:

(20|19)[0-9]{2}[-./][01][0-9][-./][0123][0-9]

Try It Out

Matching Dates

1.Open PowerGrep, and enter the regular expression pattern (20|19)[0-9]{2}[-./][01][0-9][-./][0123][0-9] in the Searc text box.

2.Enter C:\BRegExp\Ch05 in the Folder: text box, assuming that you have saved the Chapter 5 files from the download in that directory.

3.Enter Dates.txt in the File Mask text box.

4.Click the Search button, and inspect the results shown in Figure 5-16. Notice particularly that the first match, 2004-12-31, includes a hyphen confirming that the regular expression pattern works as desired.

Figure 5-16

How It Works

The first part of the pattern, (20|19), allows a choice of 20 or 19 as the first two characters of the sequence of characters being tested. Next, the pattern [0-9]{2} matches two successive numeric digits in the range 0 through 9. Next, the character class pattern [-./] matches a single character, which is a hyphen, a period, or a forward slash.

131