Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Beginning Regular Expressions 2005.pdf
Скачиваний:
95
Добавлен:
17.08.2013
Размер:
25.42 Mб
Скачать

Regular Expressions in Java

System.out.println(“The test string was: “ + testString); System.out.println(“The regular expression pattern was: “ + myRegex); while (myMatcher.find())

{

myMatch = myMatcher.group(); System.out.println(“Match found: “ + myMatch); } // end while

if (myMatch == null){ System.out.println(“There were no matches.”); } // end if

return true;

} // findMatches()

}

2.Save the code as CharClassSubtraction.java.

3.Compile the code. At the command line, type javac CharClassSubtraction.java.

4.Run the code. At the command line, type java CharClassSubtraction “A1 B2 H3 I2 J4 M5 N6”, and inspect the results, as shown in Figure 25-11. Notice that H3, I2, and J4 are not matched.

Figure 25-11

How It Works

On this occasion, the regular expression pattern is [A-Z&&[^H-M]]\d. The && operator inside the character class finds the intersection of two character classes — in this case, [A-Z] and [^H-M]. The intersection is uppercase alphabetic characters not between H and M. So [A-Z&&[^H-M]] is equivalent to [A-GN-Z]:

String myRegex = “[A-Z&&[^H-M]]\\d”;

The character sequences H3, I2, and J4 fail to match because alphabetic characters from H to M do not match the combined character class.

The POSIX Character Classes in the java.util.regex Package

The Java java.util.regex package supports several POSIX character classes but uses a syntax different from the one you have seen in OpenOffice.org, for example. The java.util.regex syntax for POSIX character classes resembles in some respects the syntax used in W3C XML Schema for Unicode character classes and character blocks. The following table lists the POSIX character classes supported in the java.util.regex package.

651

Chapter 25

Metacharacter

Description

 

 

\p{Lower}

Equivalent to the character class [a-z].

\p{Upper}

Equivalent to the character class [A-Z].

\p{ASCII]

Matches all ASCII characters. Equivalent to U+0000 through U+007F.

\p{Alpha}

Matches any alphabetic character. Equivalent to either the

 

[\p{Upper}\p{Lower}] or [A-Za-z] character classes.

\p{Digit}

Equivalent to the character class [0-9].

\p{Punct}

Equivalent to the character class [!”#$%&’()*+,-

 

./:;<=>?@[\]^_`{|}~].

\p{Graph}

Visible characters. Equivalent to the character class

 

[\p{Alpha}\p{Punct}].

\p{Print}

Printable characters. Equivalent to the character class [\p{Graph}].

\p{Blank}

A space character or tab character.

\p{Cntrl}

A control character. Equivalent to the character class [\x00-

 

\x1F\x7F].

\p{XDigit}

A hexadecimal digit. Equivalent to the character class [0-9a-fA-F].

\p{Space}

A whitespace character. Equivalent to the character class

 

[ \t\n\x0B\f\r].

 

 

Unicode Character Classes and Character Blocks

Strings in Java are sequences of Unicode characters. Each character is represented by a 2-byte number. If you are unfamiliar with how English-language characters and other characters map to Unicode code points, the Windows Character Map utility can be useful. Figure 25-12 shows e with a circumflex selected in Character Map. Notice in the lower left of the figure that the character can be expressed as U+00EA. In Java, you would write that as \u00EA.

To match characters in the Basic Latin character block, for example, use a lowercase p as in the pattern

\p{InBasicLatin}.

To match characters not in the Basic Latin character block, use an uppercase P as in the pattern

\P{InBasicLatin}.

Full information about Unicode is located at www.unicode.org. At the time of this writing, the current version of the Unicode Standard is version 4.0.1. Further

information about the Unicode Standard is located at www.unicode.org/standard/ standard.html.

652

Regular Expressions in Java

Figure 25-12

Using Escaped Characters

To match characters that are used for special purposes as regular expression metacharacters, it is necessary to escape such characters. The following table lists some of the more commonly used escaped characters in Java.

Escaped Character Sequence

Matches

 

 

\\

\ (the backslash)

\(

( (opening parenthesis)

\)

) (closing parenthesis)

\[

[ (opening square bracket)

\]

] (closing square bracket)

\^

^ (caret); used only outside a character class

\$

$ (dollar)

\?

? (question mark)

\*

* (asterisk)

 

Table continued on following page

653