Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Beginning Regular Expressions 2005.pdf
Скачиваний:
95
Добавлен:
17.08.2013
Размер:
25.42 Mб
Скачать

Visual Basic .NET and Regular Expressions

If you want to specify multiple options, you must separate the options with the word Or. So to specify case-insensitive matching that matches from right to left, you could use code such as the following:

myMatchCollection = Regex.Matches(inputString, myPattern, _

RegexOptions.IgnoreCase Or RegexOptions.RightToLeft)

Multiline Matching: The Effect on the ^ and $ Metacharacters

The ^ metacharacter normally matches the position before the first character at the beginning of a string, and the $ metacharacter normally matches the position after the last character at the end of a string.

When multiline matching is used, the ^ metacharacter matches the position before the first character at the beginning of each line, and the $ metacharacter matches the position after the last character on each line.

Inline Documentation Using the IgnorePatternWhitespace

Option

The IgnorePatternWhitespace option allows inline comments to be created that spell out the meaning of each part of the regular expression pattern.

Normally, when a regular expression pattern is matched, any whitespace in the pattern is significant. For example, a space character in the pattern is interpreted as a character to be matched. As a result of setting the IgnorePatternWhitespace option, all whitespace contained in the pattern is ignored, including space characters and newline characters. This allows a single pattern to be laid out to aid readability, to allow comments to be added, and to aid in maintenance of the regular expression pattern. To match a whitespace character, you can use the \s metacharacter.

In Visual Basic .NET, the syntax for adding inline comments is a little cumbersome. If you wanted to use the myRegex variable to match an alphabetic character followed by a numeric digit, you might typically write the following:

Dim myRegex = New Regex(“[A-Z]\d”)

However, to use the IgnorePatternWhitespace option to specify the same regular expression pattern and include comments inline, you must write something like the following:

Dim myRegex = New Regex(

_

“[A-Z] (?#

A character

class to match an uppercase alphabetic character)” & _

“\d

(?#

followed by

a numeric digit)”, & _

RegexOptions.IgnorePatternWhitespace)

The inline comments are preceded by the character sequence (?# and followed by a ) character. The Visual Basic .NET concatenation character, &, is used between the components of the pattern, and the line-continuation character (the underscore) is used to indicate that a statement is being continued on the following line.

505

Chapter 21

Try It Out

Using the IgnorePatternWhitespace Option

This example matches a U.S. Social Security number. The code contained in Module1.vb in the

IgnorePatternWhitespaceDemo project is shown here:

Imports

System.Text.RegularExpressions

 

Module Module1

 

 

 

Dim myRegex = _

 

 

New Regex _

 

 

 

 

(“^

(?#

match the position before

the first character)” & _

 

“\d{3} (?#

Three numeric digits, followed by)” & _

 

“-

(?#

a literal hyphen)” & _

 

 

“\d{2} (?#

then two numeric digits)”

& _

 

“-

(?#

then a literal hyphen)” &

_

 

“\d{4} (?#

then two numeric digits)”

& _

 

“$

(?#

match the position after the last character)”, _

RegexOptions.IgnorePatternWhitespace) Sub Main()

Console.WriteLine(“Enter a string on the following line:”) Dim inputString = Console.ReadLine()

Dim myMatch = myRegex.Match(inputString) If myMatch.ToString.Length Then

Console.WriteLine(“The match, ‘“ & myMatch.Value & “‘ was found.”)

Else

Console.WriteLine(“There was no match”) End If

Console.WriteLine(“Press Return to close this application.”) Console.ReadLine()

End Sub

End Module

1.Create a new console application project in Visual Studio 2003. Name the project

IgnorePatternWhitespaceDemo.

2.Edit the code in the code window so that it matches the code in the preceding Module1.vb file. Save the code, and press F5 to run it.

3.At the command line, enter the test string 123-12-1234. Press Return, and inspect the results, as shown in Figure 21-11. Notice that there is a successful match.

Figure 21-11

4.In Visual Studio 2005, press F5 to run the code again.

5.At the command line, enter the test string A123-12-1234. Press Return, and inspect the results, as shown in Figure 21-12.

506

Visual Basic .NET and Regular Expressions

Figure 21-12

How It Works

The pattern to be matched, if written on a single line, is as follows:

^\d{3}-\d{2}-\d{4}

You may recognize this as a simple pattern that will match a valid U.S. Social Security number (SSN).

In this example, the pattern is written with inline comments when the myRegex variable is dimensioned. As you can see, it is much more complex to write the pattern in this way, but the inline comments make it easier for you or another developer to work out precisely what the pattern was intended to do.

The Visual Basic .NET syntax of (?# ... ) for inline comments is less clean than the simple # construct in Visual C# .NET, for example. I find that the Visual Basic .NET syntax tends to get in the way of readability of the comments. Lining up the left parentheses on each line helps maximize readability in Visual Basic .NET:

New Regex _

 

 

 

(“^

(?#

match the position before

the first character)” & _

“\d{3} (?#

Three numeric digits, followed by)” & _

“-

(?#

a literal hyphen)” & _

 

“\d{2} (?#

then two numeric digits)”

& _

“-

(?#

then a literal hyphen)” &

_

“\d{4} (?#

then two numeric digits)”

& _

“$

(?#

match the position after the last character)”, _

RegexOptions.IgnorePatternWhitespace)

 

 

 

 

 

The pattern is equivalent to ^\d{3}-\d{2}-\d{4}$ and so matches an SSN. Therefore, when the test string is 123-12-1234, there is a match, as indicated in Figure 21-11. This is under control of the If statement in the following code. When the Length property is not 0, a match has been found, so the myMatch variable’s Value property contains the matching sequence of characters:

If myMatch.ToString.Length Then

Console.WriteLine(“The match, ‘“ & myMatch.Value & “‘ was found.”)

When the Length property of myMatch.ToString is 0, no match has been found, and a message indicating that is output in the Else clause:

Else

Console.WriteLine(“There was no match”)

End If

Right to Left Matching: The RightToLeft Option

When using English, the normal progression of characters along a line is from left to right. In some other languages, the progression of characters is from right to left. To support use of regular expressions in such languages, the .NET Framework provides the functionality to conduct matching from right to

left. Unfortunately, my experience and that of others is that when using the RightToLeft option, the

matching behavior is not fully reliable.

507

Chapter 21

The Metacharacters Suppor ted in Visual Basic .NET

Visual Basic .NET has perhaps a more complete and extensive regular expressions implementation than any of the tools you have seen in earlier chapters of this book.

Much of the regular expression support in Visual Basic .NET can reasonably be termed standard. However, as with many Microsoft technologies, the standard syntax and techniques have been extended or modified in places.

The following table summarizes the metacharacters supported in Visual Basic .NET.

Metacharacter

Description

 

 

\d

Matches a numeric digit.

\D

Matches any character except a numeric digit.

\w

Equivalent to the character class [A-Za-z0-9_].

\W

Equivalent to the character class [^A-Za-z0-9_].

\b

Matches the position at the beginning of a sequence of \w characters

 

or at the end of a sequence of \w characters. Colloquially, \b is

 

referred to as a word-boundary metacharacter.

\B

Matches a position that is not a \b position.

\t

Matches a tab character.

\n

Matches a newline character.

\040

Matches an ASCII character, expressed in Octal notation. The

 

metacharacter \040 matches a space character.

\x020

Matches an ASCII character, expressed in hexadecimal notation. The

 

metacharacter \x020 matches a space character.

\u0020

Matches a Unicode character, expressed in hexadecimal notation

 

with exactly four numeric digits. The metacharacter \u0020 matches

 

a space character.

[...]

Matches any character specified in the character class.

[^...]

Matches any character but the characters specified in the character

 

class.

\s

Matches a whitespace character.

\S

Matches any character that is not a whitespace character.

^

Depending on whether the MultiLine option is set, it matches the

 

position before the first character in a line or the position before the

 

first character in a string.

 

 

508

 

 

Visual Basic .NET and Regular Expressions

 

 

 

 

Metacharacter

Description

 

 

 

 

$

Depending on whether the MultiLine option is set, it matches the

 

 

position after the last character in a line or the position after the last

 

 

character in a string.

 

$number

Substitutes the character sequence matched by the last occurrence of

 

 

group number number.

 

${name}

Substitutes the character sequence matched by the last occurrence of

 

 

the group named name.

 

\A

Matches the position before the first character in a string. Its behavior

 

 

is not affected by the setting of the MultiLine option.

 

\Z

Matches the position after the last character in a string. Its behavior is

 

 

not affected by the setting of the MultiLine option.

 

\G

Specifies that matches must be consecutive, without any intervening

 

 

nonmatching characters.

 

?

A quantifier. Matches when there is zero or one occurrence of the pre-

 

 

ceding character or group.

 

*

A quantifier. Matches when there are zero or more occurrences of the

 

 

preceding character or group.

 

+

A quantifier. Matches when there are one or more occurrences of the

 

 

preceding character or group.

 

{n}

A quantifier. Matches when there are exactly n occurrences of the

 

 

preceding character or group.

 

{n,m}

A quantifier. Matches when there are at least n occurrences and a

 

 

maximum of m occurrences of the preceding character or group.

 

(substring)

Captures the contained substring.

 

(?<name>substring)

Captures the contained substring and assigns it a name.

 

(?:substring)

A non-capturing group.

 

(?=...)

A positive lookahead.

 

(?!...)

A negative lookahead.

 

(?<=...)

A positive lookbehind.

 

(?<!...)

A negative lookbehind.

 

\N (where N is a number)

A back reference to a numbered group.

 

\k<name>

A back reference that references a named back reference (same mean-

 

 

ing as the following).

 

\k’name’

A back reference that references a named back reference (same mean-

 

 

ing as the preceding).

 

!

Alternation.

 

(?imnsx-imnsx)

An alternate technique to specify RegexOptions settings inline.

 

 

 

509