- •Introduction
- •Who This Book Is For
- •What This Book Covers
- •How This Book Is Structured
- •What You Need to Use This Book
- •Conventions
- •Source Code
- •Errata
- •p2p.wrox.com
- •What Are Regular Expressions?
- •What Can Regular Expressions Be Used For?
- •Finding Doubled Words
- •Checking Input from Web Forms
- •Changing Date Formats
- •Finding Incorrect Case
- •Adding Links to URLs
- •Regular Expressions You Already Use
- •Search and Replace in Word Processors
- •Directory Listings
- •Online Searching
- •Why Regular Expressions Seem Intimidating
- •Compact, Cryptic Syntax
- •Whitespace Can Significantly Alter the Meaning
- •No Standards Body
- •Differences between Implementations
- •Characters Change Meaning in Different Contexts
- •Regular Expressions Can Be Case Sensitive
- •Case-Sensitive and Case-Insensitive Matching
- •Case and Metacharacters
- •Continual Evolution in Techniques Supported
- •Multiple Solutions for a Single Problem
- •What You Want to Do with a Regular Expression
- •Replacing Text in Quantity
- •Regular Expression Tools
- •findstr
- •Microsoft Word
- •StarOffice Writer/OpenOffice.org Writer
- •Komodo Rx Package
- •PowerGrep
- •Microsoft Excel
- •JavaScript and JScript
- •VBScript
- •Visual Basic.NET
- •Java
- •Perl
- •MySQL
- •SQL Server 2000
- •W3C XML Schema
- •An Analytical Approach to Using Regular Expressions
- •Express and Document What You Want to Do in English
- •Consider the Regular Expression Options Available
- •Consider Sensitivity and Specificity
- •Create Appropriate Regular Expressions
- •Document All but Simple Regular Expressions
- •Document What You Expect the Regular Expression to Do
- •Document What You Want to Match
- •Test the Results of a Regular Expression
- •Matching Single Characters
- •Matching Sequences of Characters That Each Occur Once
- •Introducing Metacharacters
- •Matching Sequences of Different Characters
- •Matching Optional Characters
- •Matching Multiple Optional Characters
- •Other Cardinality Operators
- •The * Quantifier
- •The + Quantifier
- •The Curly-Brace Syntax
- •The {n} Syntax
- •The {n,m} Syntax
- •Exercises
- •Regular Expression Metacharacters
- •Thinking about Characters and Positions
- •The Period (.) Metacharacter
- •Matching Variably Structured Part Numbers
- •Matching a Literal Period
- •The \w Metacharacter
- •The \W Metacharacter
- •Digits and Nondigits
- •The \d Metacharacter
- •Canadian Postal Code Example
- •The \D Metacharacter
- •Alternatives to \d and \D
- •The \s Metacharacter
- •Handling Optional Whitespace
- •The \S Metacharacter
- •The \t Metacharacter
- •The \n Metacharacter
- •Escaped Characters
- •Finding the Backslash
- •Modifiers
- •Global Search
- •Case-Insensitive Search
- •Exercises
- •Introduction to Character Classes
- •Choice between Two Characters
- •Using Quantifiers with Character Classes
- •Using the \b Metacharacter in Character Classes
- •Selecting Literal Square Brackets
- •Using Ranges in Character Classes
- •Alphabetic Ranges
- •Use [A-z] With Care
- •Digit Ranges in Character Classes
- •Hexadecimal Numbers
- •IP Addresses
- •Reverse Ranges in Character Classes
- •A Potential Range Trap
- •Finding HTML Heading Elements
- •Metacharacter Meaning within Character Classes
- •The ^ metacharacter
- •How to Use the - Metacharacter
- •Negated Character Classes
- •Combining Positive and Negative Character Classes
- •POSIX Character Classes
- •The [:alnum:] Character Class
- •Exercises
- •String, Line, and Word Boundaries
- •The ^ Metacharacter
- •The ^ Metacharacter and Multiline Mode
- •The $ Metacharacter
- •The $ Metacharacter in Multiline Mode
- •Using the ^ and $ Metacharacters Together
- •Matching Blank Lines
- •Working with Dollar Amounts
- •Revisiting the IP Address Example
- •What Is a Word?
- •Identifying Word Boundaries
- •The \< Syntax
- •The \>Syntax
- •The \b Syntax
- •The \B Metacharacter
- •Less-Common Word-Boundary Metacharacters
- •Exercises
- •Grouping Using Parentheses
- •Parentheses and Quantifiers
- •Matching Literal Parentheses
- •U.S. Telephone Number Example
- •Alternation
- •Choosing among Multiple Options
- •Unexpected Alternation Behavior
- •Capturing Parentheses
- •Numbering of Captured Groups
- •Numbering When Using Nested Parentheses
- •Named Groups
- •Non-Capturing Parentheses
- •Back References
- •Exercises
- •Why You Need Lookahead and Lookbehind
- •The (? metacharacters
- •Lookahead
- •Positive Lookahead
- •Negative Lookahead
- •Positive Lookahead Examples
- •Positive Lookahead in the Same Document
- •Inserting an Apostrophe
- •Lookbehind
- •Positive Lookbehind
- •Negative Lookbehind
- •How to Match Positions
- •Adding Commas to Large Numbers
- •Exercises
- •What Are Sensitivity and Specificity?
- •Extreme Sensitivity, Awful Specificity
- •Email Addresses Example
- •Replacing Hyphens Example
- •The Sensitivity/Specificity Trade-Off
- •Sensitivity, Specificity, and Positional Characters
- •Sensitivity, Specificity, and Modes
- •Sensitivity, Specificity, and Lookahead and Lookbehind
- •How Much Should the Regular Expressions Do?
- •Abbreviations
- •Characters from Other Languages
- •Names
- •Sensitivity and How to Achieve It
- •Specificity and How to Maximize It
- •Exercises
- •Documenting Regular Expressions
- •Document the Problem Definition
- •Add Comments to Your Code
- •Making Use of Extended Mode
- •Know Your Data
- •Abbreviations
- •Proper Names
- •Incorrect Spelling
- •Creating Test Cases
- •Debugging Regular Expressions
- •Treacherous Whitespace
- •Backslashes Causing Problems
- •Considering Other Causes
- •The User Interface
- •Metacharacters Available
- •Quantifiers
- •The @ Quantifier
- •The {n,m} Syntax
- •Modes
- •Character Classes
- •Back References
- •Lookahead and Lookbehind
- •Lazy Matching versus Greedy Matching
- •Examples
- •Character Class Examples, Including Ranges
- •Whole Word Searches
- •Search-and-Replace Examples
- •Changing Name Structure Using Back References
- •Manipulating Dates
- •The Star Training Company Example
- •Regular Expressions in Visual Basic for Applications
- •Exercises
- •The User Interface
- •Metacharacters Available
- •Quantifiers
- •Modes
- •Character Classes
- •Alternation
- •Back References
- •Lookahead and Lookbehind
- •Search Example
- •Search-and-Replace Example
- •Online Chats
- •POSIX Character Classes
- •Matching Numeric Digits
- •Exercises
- •Introducing findstr
- •Finding Literal Text
- •Quantifiers
- •Character Classes
- •Command-Line Switch Examples
- •The /v Switch
- •The /a Switch
- •Single File Examples
- •Simple Character Class Example
- •Find Protocols Example
- •Multiple File Example
- •A Filelist Example
- •Exercises
- •The PowerGREP Interface
- •A Simple Find Example
- •The Replace Tab
- •The File Finder Tab
- •Syntax Coloring
- •Other Tabs
- •Numeric Digits and Alphabetic Characters
- •Quantifiers
- •Back References
- •Alternation
- •Line Position Metacharacters
- •Word-Boundary Metacharacters
- •Lookahead and Lookbehind
- •Longer Examples
- •Finding HTML Horizontal Rule Elements
- •Matching Time Example
- •Exercises
- •The Excel Find Interface
- •Escaping Wildcard Characters
- •Using Wildcards in Data Forms
- •Using Wildcards in Filters
- •Exercises
- •Using LIKE with Regular Expressions
- •The % Metacharacter
- •The _ Metacharacter
- •Character Classes
- •Negated Character Classes
- •Using Full-Text Search
- •Using The CONTAINS Predicate
- •Document Filters on Image Columns
- •Exercises
- •Using the _ and % Metacharacters
- •Testing Matching of Literals: _ and % Metacharacters
- •Using Positional Metacharacters
- •Using Character Classes
- •Quantifiers
- •Social Security Number Example
- •Exercises
- •The Interface to Metacharacters in Microsoft Access
- •Creating a Hard-Wired Query
- •Creating a Parameter Query
- •Using the ? Metacharacter
- •Using the * Metacharacter
- •Using the # Metacharacter
- •Using the # Character with Date/Time Data
- •Using Character Classes in Access
- •Exercises
- •The RegExp Object
- •Attributes of the RegExp Object
- •The Other Properties of the RegExp Object
- •The test() Method of the RegExp Object
- •The exec() Method of the RegExp Object
- •The String Object
- •Metacharacters in JavaScript and JScript
- •SSN Validation Example
- •Exercises
- •The RegExp Object and How to Use It
- •Quantifiers
- •Positional Metacharacters
- •Character Classes
- •Word Boundaries
- •Lookahead
- •Grouping and Nongrouping Parentheses
- •Exercises
- •The System.Text.RegularExpressions namespace
- •A Simple Visual Basic .NET Example
- •The Classes of System.Text.RegularExpressions
- •The Regex Object
- •Using the Match Object and Matches Collection
- •Using the Match.Success Property and Match.NextMatch Method
- •The GroupCollection and Group Classes
- •The CaptureCollection and Capture Class
- •The RegexOptions Enumeration
- •Case-Insensitive Matching: The IgnoreCase Option
- •Multiline Matching: The Effect on the ^ and $ Metacharacters
- •Right to Left Matching: The RightToLeft Option
- •Lookahead and Lookbehind
- •Exercises
- •An Introductory Example
- •The Classes of System.Text.RegularExpressions
- •The Regex Class
- •The Options Property of the Regex Class
- •Regex Class Methods
- •The CompileToAssembly() Method
- •The GetGroupNames() Method
- •The GetGroupNumbers() Method
- •GroupNumberFromName() and GroupNameFromNumber() Methods
- •The IsMatch() Method
- •The Match() Method
- •The Matches() Method
- •The Replace() Method
- •The Split() Method
- •Using the Static Methods of the Regex Class
- •The IsMatch() Method as a Static
- •The Match() Method as a Static
- •The Matches() Method as a Static
- •The Replace() Method as a Static
- •The Split() Method as a Static
- •The Match and Matches Classes
- •The Match Class
- •The GroupCollection and Group Classes
- •The RegexOptions Class
- •The IgnorePatternWhitespace Option
- •Metacharacters Supported in Visual C# .NET
- •Using Named Groups
- •Using Back References
- •Exercise
- •The ereg() Set of Functions
- •The ereg() Function
- •The ereg() Function with Three Arguments
- •The eregi() Function
- •The ereg_replace() Function
- •The eregi_replace() Function
- •The split() Function
- •The spliti() Function
- •The sql_regcase() Function
- •Perl Compatible Regular Expressions
- •Pattern Delimiters in PCRE
- •Escaping Pattern Delimiters
- •Matching Modifiers in PCRE
- •Using the preg_match() Function
- •Using the preg_match_all() Function
- •Using the preg_grep() Function
- •Using the preg_quote() Function
- •Using the preg_replace() Function
- •Using the preg_replace_callback() Function
- •Using the preg_split() Function
- •Supported Metacharacters with ereg()
- •Using POSIX Character Classes with PHP
- •Supported Metacharacters with PCRE
- •Positional Metacharacters
- •Character Classes in PHP
- •Documenting PHP Regular Expressions
- •Exercises
- •W3C XML Schema Basics
- •Tools for Using W3C XML Schema
- •Comparing XML Schema and DTDs
- •How Constraints Are Expressed in W3C XML Schema
- •W3C XML Schema Datatypes
- •Derivation by Restriction
- •Unicode and W3C XML Schema
- •Unicode Overview
- •Using Unicode Character Classes
- •Matching Decimal Numbers
- •Mixing Unicode Character Classes with Other Metacharacters
- •Unicode Character Blocks
- •Using Unicode Character Blocks
- •Metacharacters Supported in W3C XML Schema
- •Positional Metacharacters
- •Matching Numeric Digits
- •Alternation
- •Using the \w and \s Metacharacters
- •Escaping Metacharacters
- •Exercises
- •Introduction to the java.util.regex Package
- •Obtaining and Installing Java
- •The Pattern Class
- •Using the matches() Method Statically
- •Two Simple Java Examples
- •The Properties (Fields) of the Pattern Class
- •The CASE_INSENSITIVE Flag
- •Using the COMMENTS Flag
- •The DOTALL Flag
- •The MULTILINE Flag
- •The UNICODE_CASE Flag
- •The UNIX_LINES Flag
- •The Methods of the Pattern Class
- •The compile() Method
- •The flags() Method
- •The matcher() Method
- •The matches() Method
- •The pattern() Method
- •The split() Method
- •The Matcher Class
- •The appendReplacement() Method
- •The appendTail() Method
- •The end() Method
- •The find() Method
- •The group() Method
- •The groupCount() Method
- •The lookingAt() Method
- •The matches() Method
- •The pattern() Method
- •The replaceAll() Method
- •The replaceFirst() Method
- •The reset() Method
- •The start() Method
- •The PatternSyntaxException Class
- •Using the \d Metacharacter
- •Character Classes
- •The POSIX Character Classes in the java.util.regex Package
- •Unicode Character Classes and Character Blocks
- •Using Escaped Characters
- •Using Methods of the String Class
- •Using the matches() Method
- •Using the replaceFirst() Method
- •Using the replaceAll() Method
- •Using the split() Method
- •Exercises
- •Obtaining and Installing Perl
- •Creating a Simple Perl Program
- •Basics of Perl Regular Expression Usage
- •Using the m// Operator
- •Using Other Regular Expression Delimiters
- •Matching Using Variable Substitution
- •Using the s/// Operator
- •Using s/// with the Global Modifier
- •Using s/// with the Default Variable
- •Using the split Operator
- •Using Quantifiers in Perl
- •Using Positional Metacharacters
- •Captured Groups in Perl
- •Using Back References in Perl
- •Using Alternation
- •Using Character Classes in Perl
- •Using Lookahead
- •Using Lookbehind
- •Escaping Metacharacters
- •A Simple Perl Regex Tester
- •Exercises
- •Index
Chapter 21
Lookahead and Lookbehind
Support for positive and negative lookahead and lookbehind in Visual Basic .NET is good. All four options are supported.
Positive lookahead uses the (?=theLookahead) syntax. To match the word Star when followed by a space character and the character sequence Training, you could use the following code:
Dim myRegex = New Regex(“Star(?= Training)”)
Dim myMatch = myRegex.Match(“The Star Training Company carries out great training.”)
Negative lookahead uses the (?!theLookahead) syntax. To match the character sequence Star when it is not followed by a space character and the character sequence Training, you could use the following code:
Dim myRegex = New Regex(“Star(?! Training)”)
Dim myMatch = myRegex.Match(“The Star Training Company carries out great training.”)
Positive lookbehind uses the (?<=theLookbehind) syntax. To match the character sequence Training when it is preceded by the character sequence Star followed by a space character, you could use the following code:
Dim myRegex = New Regex(“(?<=Star )Training)”)
Dim myMatch = myRegex.Match(“The Star Training Company carries out great training.”)
Negative lookbehind uses the (?<!theLookbehind) syntax. To match the character sequence Training when it is not preceded by the character sequence Star followed by a space character, you could use the following code:
Dim myRegex = New Regex(“(?<!Star )Training)”)
Dim myMatch = myRegex.Match(“The Star Training Company carries out great training.”)
Exercises
1.Specify a pattern that will match the character sequence old only when it is part of a word such as cold or bold. Hint: Provide two solutions, one of which uses lookbehind and lookahead.
2.Create a console application that replaces the character sequence Doctor or Doc with the character sequence Dr..
510
22
C# and Regular Expressions
Microsoft Visual C# .NET provides extensive, powerful, and flexible support for regular expression functionality. Visual C# .NET provides support comparable to Perl version 5, plus some extensions that are essentially specific to the .NET Framework (for example, right-to-left matching). The implementations of regular expressions are essentially playing an ongoing game of catch-up, and it is likely that at least some other languages will also implement features such as right-to-left matching in time.
In this chapter, you will learn how to do the following:
Use the objects contained in the System.Text.RegularExpresssions namespace
Use the metacharacters supported in C#
Examples shown in this chapter have been tested with Visual Studio 2003 and the
.NET Framework 1.1. I will assume that you have access to a copy of Visual Studio 2003 and have a working knowledge of at least the basics of Visual C# .NET. It isn’t the intent of this chapter to provide a tutorial on the basics of using Visual Studio
.NET 2003.
However, if you do not have access to a copy of Visual Studio 2003, there are copies of the .exe files you can run, although you won’t be able to view and edit the Visual C# .NET code if you use the .exe files.
The regular expression functionality in C# is based on the classes in the System.Text
.RegularExpressions namespace. Those classes will be explained in some detail, including several examples of how the classes, their properties, and methods can be used in code.
Chapter 22
The Classes of the System.Text
.RegularExpressions namespace
The regular expressions support in the .NET Framework class library is contained in the
System.Text.RegularExpressions namespace.
An Introductory Example
This example demonstrates the basics of one way to use regular expressions when using Visual C#. Other techniques are discussed and demonstrated later in the chapter, when the classes of the System.Text.RegularExpressions namespace and their members are discussed in more detail.
Try It Out |
An Introductory C# Console Application Example |
The following code is contained in Class1.cs in the SimpleMatch project:
using System;
using System.Text.RegularExpressions;
namespace SimpleMatch
{
///<summary>
///This is a simple regular expression example which uses the Regex object.
///</summary>
class Class1
{
///<summary>
///The main entry point for the application.
///</summary>
[STAThread]
static void Main(string[] args)
{
Console.WriteLine(@”This will find a match for the regular expression ‘[A-Z]\d’.”);
Console.WriteLine(“Enter a test string now.”);
Regex myRegex = new Regex(@”[A-Z]\d”, RegexOptions.IgnoreCase); string inputString;
inputString = Console.ReadLine();
Match myMatch = myRegex.Match(inputString); Console.WriteLine(“You entered the string: ‘“ + inputString +
“‘.”);
if (myMatch.Success)
Console.WriteLine(“The match ‘“ + myMatch.ToString() + “‘ was found in the string you entered.”);
Console.ReadLine();
}
}
}
512
C# and Regular Expressions
The following instructions walk you through all the steps necessary to create a simple console application in Visual Studio 2003 using Visual C# .NET. If you have done much programming in C#, you will find most of the steps pretty self-evident.
1.Open Visual Studio 2003, and from the File menu, select New; then select Project to create a new solution that contains a single project.
Figure 22-1 shows the appearance of the Project screen, but with the choices specified in Steps 2 through 5 already made.
Figure 22-1
2.In the Project Types pane, select Visual C# Projects.
3.In the Templates pane, select Console Application.
4.In the Name text box, type SimpleMatch as the name of the project.
5.In the Location text box, type C:\BRegExp\Ch22 as the location (or select another location, if you prefer).
513
Chapter 22
6.Click the OK button. After a short pause while Visual Studio 2003 is creating the files needed for the project, the code editor will open with the following template code already in place:
using System;
namespace SimpleMatch
{
///<summary>
///Summary description for Class1.
///</summary>
class Class1
{
///<summary>
///The main entry point for the application.
///</summary>
[STAThread]
static void Main(string[] args)
{
//
// TODO: Add code to start application here
//
}
}
}
7.Edit the preceding template code so that it contains the code shown earlier in Class1.cs.
8.Save the code using Ctrl+Shift+S. Press F5 to run the code.
9.At the command line, enter the test text K9. Then press Return. Inspect the results, as shown in Figure 22-2.
Figure 22-2
How It Works
When using C# you must specify the components of the .NET Framework class library that you are using. Visual Studio 2003 automatically adds the following line when the file Class1.cs is created:
using System;
And because you are using the Regex class from the System.Text.RegularExpresssions namespace, it is appropriate to add a using statement referencing that namespace, too:
using System.Text.RegularExpressions;
514
C# and Regular Expressions
The alternative approach is to use fully qualified names when referring to an object. For example, with the using System.Text.RegularExpressions; statement in the code, you can simply write the following to declare the myRegex object variable and assign it a value:
Regex myRegex = new Regex(@”[A-Z]\d”, RegexOptions.IgnoreCase);
If the using System.Text.RegularExpressions; statement is missing, and you attempt to run the code, you will receive a bundle of error messages, including the following, because the Regex class is not found in the System namespace, the only namespace that is declared by the default template code created by Visual Studio 2003:
The type or namespace name ‘Regex’ could not be found.
So to declare the myRegex variable and assign it a value, you would have to write the following code, using fully qualified names, because the Regex and RegexOptions classes are contained in the
System.Text.RegularExpressions namespace:
System.Text.RegularExpressions.Regex myRegex = new
System.Text.RegularExpressions.Regex(@”[A-Z]\d”,
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
Similarly, it would be necessary to write the following to declare the myMatch object variable and assign it a value:
System.Text.RegularExpressions.Match myMatch = myRegex.Match(inputString);
In all but the most trivial code, it is easier to write and read code when the using System.Text.RegularExpressions; statement is present.
There are automatically generated stubs for documentation comments in Class1.cs and an automatically generated namespace corresponding to the project name and a class name — by default, Class1.
The content of the Main() method is where the work of this simple example is carried out:
static void Main(string[] args)
{
First, a message is written to the command window using the Console object’s WriteLine() method. The Console class is a member of the System namespace, which has already been referenced using the using System; statement, so you can simply write Console.Writeline() with appropriate content between the parentheses:
Console.WriteLine(@”This will find a match for the regular expression ‘[A-Z]\d’.”);
Notice that the first character inside the parentheses of the WriteLine() method is an @ character. This is used because without it, an error would be reported, because C# is unable to recognize the character sequence \d. In the absence of the @ character, you would have to write the string in the double quotes as “This will find a match for the regular expression ‘[A-Z]\\d’.”. In other words, you must write \\d for C# to recognize this as meaning the regular expression metacharacter \d.
515
Chapter 22
Personally, I prefer adding the @ character, because I can then use the familiar regular expression syntax that I use in other languages. Because I use regular expressions across various languages and tools, I tend to avoid the double-backslash notation.
Next, a straightforward information string is output:
Console.WriteLine(“Enter a test string now.”);
Next, an object variable, myRegex, is declared as inheriting from the Regex class. As explained earlier, writing Regex is a convenient abbreviation for the fully qualified name System.Text.RegularExpressions
.Regex. The regular expression pattern [A-Z]\d is the first argument for the Regex() constructor and specifies that pattern as the pattern against which matching will take place. The second argument of the Regex() constructor specifies that the option of case-insensitive matching is to be used:
Regex myRegex = new Regex(@”[A-Z]\d”, RegexOptions.IgnoreCase);
Next, a string variable, inputString, is declared:
string inputString;
The Console class’s ReadLine() method is used to read the text entered by the user. The value read is assigned to the inputString variable:
inputString = Console.ReadLine();
The object variable myMatch is declared as inheriting from the Match class. The value assigned to the myMatch variable is specified using the Regex class’s Match() method with the inputString variable as its argument. In other words, the myMatch variable contains the first match found in the inputString variable using the regular expression pattern [A-Z]\d that was assigned earlier to the myRegex variable:
Match myMatch = myRegex.Match(inputString);
Now that you have a match, you first output the value of the inputString variable to remind or inform the user of the string that was captured using the Console.ReadLine() method:
Console.WriteLine(“You entered the string: ‘“ + inputString + “‘.”);
An if statement is used that tests the value of the Success property of the myMatch object variable. If a match has been found (as indicated by the value of the Success property), a string is output using Console.WriteLine() to inform the user of the content of the match:
if (myMatch.Success)
Console.WriteLine(“The match ‘“ + myMatch.ToString() + “‘ was found in the string you entered.”);
The ReadLine() method is used so that the displayed match remains on-screen until the user presses the Return key:
Console.ReadLine();
}
516