- •Introduction
- •Who This Book Is For
- •What This Book Covers
- •How This Book Is Structured
- •What You Need to Use This Book
- •Conventions
- •Source Code
- •Errata
- •p2p.wrox.com
- •What Are Regular Expressions?
- •What Can Regular Expressions Be Used For?
- •Finding Doubled Words
- •Checking Input from Web Forms
- •Changing Date Formats
- •Finding Incorrect Case
- •Adding Links to URLs
- •Regular Expressions You Already Use
- •Search and Replace in Word Processors
- •Directory Listings
- •Online Searching
- •Why Regular Expressions Seem Intimidating
- •Compact, Cryptic Syntax
- •Whitespace Can Significantly Alter the Meaning
- •No Standards Body
- •Differences between Implementations
- •Characters Change Meaning in Different Contexts
- •Regular Expressions Can Be Case Sensitive
- •Case-Sensitive and Case-Insensitive Matching
- •Case and Metacharacters
- •Continual Evolution in Techniques Supported
- •Multiple Solutions for a Single Problem
- •What You Want to Do with a Regular Expression
- •Replacing Text in Quantity
- •Regular Expression Tools
- •findstr
- •Microsoft Word
- •StarOffice Writer/OpenOffice.org Writer
- •Komodo Rx Package
- •PowerGrep
- •Microsoft Excel
- •JavaScript and JScript
- •VBScript
- •Visual Basic.NET
- •Java
- •Perl
- •MySQL
- •SQL Server 2000
- •W3C XML Schema
- •An Analytical Approach to Using Regular Expressions
- •Express and Document What You Want to Do in English
- •Consider the Regular Expression Options Available
- •Consider Sensitivity and Specificity
- •Create Appropriate Regular Expressions
- •Document All but Simple Regular Expressions
- •Document What You Expect the Regular Expression to Do
- •Document What You Want to Match
- •Test the Results of a Regular Expression
- •Matching Single Characters
- •Matching Sequences of Characters That Each Occur Once
- •Introducing Metacharacters
- •Matching Sequences of Different Characters
- •Matching Optional Characters
- •Matching Multiple Optional Characters
- •Other Cardinality Operators
- •The * Quantifier
- •The + Quantifier
- •The Curly-Brace Syntax
- •The {n} Syntax
- •The {n,m} Syntax
- •Exercises
- •Regular Expression Metacharacters
- •Thinking about Characters and Positions
- •The Period (.) Metacharacter
- •Matching Variably Structured Part Numbers
- •Matching a Literal Period
- •The \w Metacharacter
- •The \W Metacharacter
- •Digits and Nondigits
- •The \d Metacharacter
- •Canadian Postal Code Example
- •The \D Metacharacter
- •Alternatives to \d and \D
- •The \s Metacharacter
- •Handling Optional Whitespace
- •The \S Metacharacter
- •The \t Metacharacter
- •The \n Metacharacter
- •Escaped Characters
- •Finding the Backslash
- •Modifiers
- •Global Search
- •Case-Insensitive Search
- •Exercises
- •Introduction to Character Classes
- •Choice between Two Characters
- •Using Quantifiers with Character Classes
- •Using the \b Metacharacter in Character Classes
- •Selecting Literal Square Brackets
- •Using Ranges in Character Classes
- •Alphabetic Ranges
- •Use [A-z] With Care
- •Digit Ranges in Character Classes
- •Hexadecimal Numbers
- •IP Addresses
- •Reverse Ranges in Character Classes
- •A Potential Range Trap
- •Finding HTML Heading Elements
- •Metacharacter Meaning within Character Classes
- •The ^ metacharacter
- •How to Use the - Metacharacter
- •Negated Character Classes
- •Combining Positive and Negative Character Classes
- •POSIX Character Classes
- •The [:alnum:] Character Class
- •Exercises
- •String, Line, and Word Boundaries
- •The ^ Metacharacter
- •The ^ Metacharacter and Multiline Mode
- •The $ Metacharacter
- •The $ Metacharacter in Multiline Mode
- •Using the ^ and $ Metacharacters Together
- •Matching Blank Lines
- •Working with Dollar Amounts
- •Revisiting the IP Address Example
- •What Is a Word?
- •Identifying Word Boundaries
- •The \< Syntax
- •The \>Syntax
- •The \b Syntax
- •The \B Metacharacter
- •Less-Common Word-Boundary Metacharacters
- •Exercises
- •Grouping Using Parentheses
- •Parentheses and Quantifiers
- •Matching Literal Parentheses
- •U.S. Telephone Number Example
- •Alternation
- •Choosing among Multiple Options
- •Unexpected Alternation Behavior
- •Capturing Parentheses
- •Numbering of Captured Groups
- •Numbering When Using Nested Parentheses
- •Named Groups
- •Non-Capturing Parentheses
- •Back References
- •Exercises
- •Why You Need Lookahead and Lookbehind
- •The (? metacharacters
- •Lookahead
- •Positive Lookahead
- •Negative Lookahead
- •Positive Lookahead Examples
- •Positive Lookahead in the Same Document
- •Inserting an Apostrophe
- •Lookbehind
- •Positive Lookbehind
- •Negative Lookbehind
- •How to Match Positions
- •Adding Commas to Large Numbers
- •Exercises
- •What Are Sensitivity and Specificity?
- •Extreme Sensitivity, Awful Specificity
- •Email Addresses Example
- •Replacing Hyphens Example
- •The Sensitivity/Specificity Trade-Off
- •Sensitivity, Specificity, and Positional Characters
- •Sensitivity, Specificity, and Modes
- •Sensitivity, Specificity, and Lookahead and Lookbehind
- •How Much Should the Regular Expressions Do?
- •Abbreviations
- •Characters from Other Languages
- •Names
- •Sensitivity and How to Achieve It
- •Specificity and How to Maximize It
- •Exercises
- •Documenting Regular Expressions
- •Document the Problem Definition
- •Add Comments to Your Code
- •Making Use of Extended Mode
- •Know Your Data
- •Abbreviations
- •Proper Names
- •Incorrect Spelling
- •Creating Test Cases
- •Debugging Regular Expressions
- •Treacherous Whitespace
- •Backslashes Causing Problems
- •Considering Other Causes
- •The User Interface
- •Metacharacters Available
- •Quantifiers
- •The @ Quantifier
- •The {n,m} Syntax
- •Modes
- •Character Classes
- •Back References
- •Lookahead and Lookbehind
- •Lazy Matching versus Greedy Matching
- •Examples
- •Character Class Examples, Including Ranges
- •Whole Word Searches
- •Search-and-Replace Examples
- •Changing Name Structure Using Back References
- •Manipulating Dates
- •The Star Training Company Example
- •Regular Expressions in Visual Basic for Applications
- •Exercises
- •The User Interface
- •Metacharacters Available
- •Quantifiers
- •Modes
- •Character Classes
- •Alternation
- •Back References
- •Lookahead and Lookbehind
- •Search Example
- •Search-and-Replace Example
- •Online Chats
- •POSIX Character Classes
- •Matching Numeric Digits
- •Exercises
- •Introducing findstr
- •Finding Literal Text
- •Quantifiers
- •Character Classes
- •Command-Line Switch Examples
- •The /v Switch
- •The /a Switch
- •Single File Examples
- •Simple Character Class Example
- •Find Protocols Example
- •Multiple File Example
- •A Filelist Example
- •Exercises
- •The PowerGREP Interface
- •A Simple Find Example
- •The Replace Tab
- •The File Finder Tab
- •Syntax Coloring
- •Other Tabs
- •Numeric Digits and Alphabetic Characters
- •Quantifiers
- •Back References
- •Alternation
- •Line Position Metacharacters
- •Word-Boundary Metacharacters
- •Lookahead and Lookbehind
- •Longer Examples
- •Finding HTML Horizontal Rule Elements
- •Matching Time Example
- •Exercises
- •The Excel Find Interface
- •Escaping Wildcard Characters
- •Using Wildcards in Data Forms
- •Using Wildcards in Filters
- •Exercises
- •Using LIKE with Regular Expressions
- •The % Metacharacter
- •The _ Metacharacter
- •Character Classes
- •Negated Character Classes
- •Using Full-Text Search
- •Using The CONTAINS Predicate
- •Document Filters on Image Columns
- •Exercises
- •Using the _ and % Metacharacters
- •Testing Matching of Literals: _ and % Metacharacters
- •Using Positional Metacharacters
- •Using Character Classes
- •Quantifiers
- •Social Security Number Example
- •Exercises
- •The Interface to Metacharacters in Microsoft Access
- •Creating a Hard-Wired Query
- •Creating a Parameter Query
- •Using the ? Metacharacter
- •Using the * Metacharacter
- •Using the # Metacharacter
- •Using the # Character with Date/Time Data
- •Using Character Classes in Access
- •Exercises
- •The RegExp Object
- •Attributes of the RegExp Object
- •The Other Properties of the RegExp Object
- •The test() Method of the RegExp Object
- •The exec() Method of the RegExp Object
- •The String Object
- •Metacharacters in JavaScript and JScript
- •SSN Validation Example
- •Exercises
- •The RegExp Object and How to Use It
- •Quantifiers
- •Positional Metacharacters
- •Character Classes
- •Word Boundaries
- •Lookahead
- •Grouping and Nongrouping Parentheses
- •Exercises
- •The System.Text.RegularExpressions namespace
- •A Simple Visual Basic .NET Example
- •The Classes of System.Text.RegularExpressions
- •The Regex Object
- •Using the Match Object and Matches Collection
- •Using the Match.Success Property and Match.NextMatch Method
- •The GroupCollection and Group Classes
- •The CaptureCollection and Capture Class
- •The RegexOptions Enumeration
- •Case-Insensitive Matching: The IgnoreCase Option
- •Multiline Matching: The Effect on the ^ and $ Metacharacters
- •Right to Left Matching: The RightToLeft Option
- •Lookahead and Lookbehind
- •Exercises
- •An Introductory Example
- •The Classes of System.Text.RegularExpressions
- •The Regex Class
- •The Options Property of the Regex Class
- •Regex Class Methods
- •The CompileToAssembly() Method
- •The GetGroupNames() Method
- •The GetGroupNumbers() Method
- •GroupNumberFromName() and GroupNameFromNumber() Methods
- •The IsMatch() Method
- •The Match() Method
- •The Matches() Method
- •The Replace() Method
- •The Split() Method
- •Using the Static Methods of the Regex Class
- •The IsMatch() Method as a Static
- •The Match() Method as a Static
- •The Matches() Method as a Static
- •The Replace() Method as a Static
- •The Split() Method as a Static
- •The Match and Matches Classes
- •The Match Class
- •The GroupCollection and Group Classes
- •The RegexOptions Class
- •The IgnorePatternWhitespace Option
- •Metacharacters Supported in Visual C# .NET
- •Using Named Groups
- •Using Back References
- •Exercise
- •The ereg() Set of Functions
- •The ereg() Function
- •The ereg() Function with Three Arguments
- •The eregi() Function
- •The ereg_replace() Function
- •The eregi_replace() Function
- •The split() Function
- •The spliti() Function
- •The sql_regcase() Function
- •Perl Compatible Regular Expressions
- •Pattern Delimiters in PCRE
- •Escaping Pattern Delimiters
- •Matching Modifiers in PCRE
- •Using the preg_match() Function
- •Using the preg_match_all() Function
- •Using the preg_grep() Function
- •Using the preg_quote() Function
- •Using the preg_replace() Function
- •Using the preg_replace_callback() Function
- •Using the preg_split() Function
- •Supported Metacharacters with ereg()
- •Using POSIX Character Classes with PHP
- •Supported Metacharacters with PCRE
- •Positional Metacharacters
- •Character Classes in PHP
- •Documenting PHP Regular Expressions
- •Exercises
- •W3C XML Schema Basics
- •Tools for Using W3C XML Schema
- •Comparing XML Schema and DTDs
- •How Constraints Are Expressed in W3C XML Schema
- •W3C XML Schema Datatypes
- •Derivation by Restriction
- •Unicode and W3C XML Schema
- •Unicode Overview
- •Using Unicode Character Classes
- •Matching Decimal Numbers
- •Mixing Unicode Character Classes with Other Metacharacters
- •Unicode Character Blocks
- •Using Unicode Character Blocks
- •Metacharacters Supported in W3C XML Schema
- •Positional Metacharacters
- •Matching Numeric Digits
- •Alternation
- •Using the \w and \s Metacharacters
- •Escaping Metacharacters
- •Exercises
- •Introduction to the java.util.regex Package
- •Obtaining and Installing Java
- •The Pattern Class
- •Using the matches() Method Statically
- •Two Simple Java Examples
- •The Properties (Fields) of the Pattern Class
- •The CASE_INSENSITIVE Flag
- •Using the COMMENTS Flag
- •The DOTALL Flag
- •The MULTILINE Flag
- •The UNICODE_CASE Flag
- •The UNIX_LINES Flag
- •The Methods of the Pattern Class
- •The compile() Method
- •The flags() Method
- •The matcher() Method
- •The matches() Method
- •The pattern() Method
- •The split() Method
- •The Matcher Class
- •The appendReplacement() Method
- •The appendTail() Method
- •The end() Method
- •The find() Method
- •The group() Method
- •The groupCount() Method
- •The lookingAt() Method
- •The matches() Method
- •The pattern() Method
- •The replaceAll() Method
- •The replaceFirst() Method
- •The reset() Method
- •The start() Method
- •The PatternSyntaxException Class
- •Using the \d Metacharacter
- •Character Classes
- •The POSIX Character Classes in the java.util.regex Package
- •Unicode Character Classes and Character Blocks
- •Using Escaped Characters
- •Using Methods of the String Class
- •Using the matches() Method
- •Using the replaceFirst() Method
- •Using the replaceAll() Method
- •Using the split() Method
- •Exercises
- •Obtaining and Installing Perl
- •Creating a Simple Perl Program
- •Basics of Perl Regular Expression Usage
- •Using the m// Operator
- •Using Other Regular Expression Delimiters
- •Matching Using Variable Substitution
- •Using the s/// Operator
- •Using s/// with the Global Modifier
- •Using s/// with the Default Variable
- •Using the split Operator
- •Using Quantifiers in Perl
- •Using Positional Metacharacters
- •Captured Groups in Perl
- •Using Back References in Perl
- •Using Alternation
- •Using Character Classes in Perl
- •Using Lookahead
- •Using Lookbehind
- •Escaping Metacharacters
- •A Simple Perl Regex Tester
- •Exercises
- •Index
Chapter 4
Regular Expression Metacharacters
You saw in Chapter 3 how literal characters can be combined with quantifiers to create useful but fairly simple regular expression patterns. However, literal characters are pretty restrictive in what they match. Sometimes, it is desirable or necessary to allow more flexible matching. Several metacharacters match a class of characters rather than simply a single literal character. That wider scope can be very useful.
Many of the metacharacters referred to and demonstrated in this chapter consist of two characters. The term metasequence is sometimes used to refer to such pairs of characters that, taken together, convey the meaning of a metacharacter. I use the terms metacharacter and metasequence interchangeably.
For example, consider a parts inventory, Inventory.txt, such as the following:
D99C44
A9DC55
CODD29
RT2C23
MNZC55
UVCC83
Notice the variability in how the first three characters of the sample part numbers are structured. For example, the first part number has an alphabetic character followed by two numeric digits. However, the second part number has a single alphabetic character followed by a single numeric digit, followed by a single alphabetic character. The techniques you have used previously won’t allow you to specify a suitable regular expression pattern, because the structure of a part number is too variable to allow you to easily address the problem using literal characters in a regular expression pattern. The task you want to carry out is to achieve matches to correspond to the following problem definition:
Match part numbers where the fourth character is an uppercase C and the fifth and sixth characters are numeric digits.
If the data is simple, with a relatively small number of options for any individual character, it might be possible to provide a solution using the alternation techniques described in Chapter 7. However, for the purposes of this chapter, assume that the data is so varied that other techniques should be used.
Thinking about Characters and Positions
One of the important basic concepts that you need to grasp is the difference between a character and a position.
To make the distinction between a character and a position clear, look at the following sample text:
This is a simple sentence.
74
Metacharacters and Modifiers
The first character in the sample text is the uppercase T of This. However, there is a position immediately before the uppercase T. The position is not visible and does not match any of the literal characters discussed in Chapter 3. However, there are metacharacters that match a position, such as the ^ metacharacter, which matches the position immediately before the uppercase T in the sample text. Metacharacters that match positions rather than characters are introduced in detail in Chapter 6.
The second character in the sample text is the lowercase h of This. Between the initial uppercase T and the lowercase h, there is a position. Often, such positions between the letters of a sequence of characters (in other words, positions inside words) are not of specific interest to a developer. However, positions at the beginning of a string, at the end of a string, and at the beginning and end of a sequence of alphabetic characters are often of more interest to developers, which is why there are metacharacters that correspond to such positions. The so-called word-boundary metacharacters (strictly speaking, they match the boundaries of a sequence of alphabetic or alphanumeric characters) match a position between an alphabetic character and a nonalphabetic character. In many situations, those boundaries will correspond to the boundaries of a word. Those metacharacters are introduced in Chapter 6.
Metacharacters that match classes of characters are also very useful, and it is those that this chapter tackles.
The Period (.) Metacharacter
The period is one of the most broadly scoped metacharacters. It can match any alphabetic character, whether lowercase or uppercase, as well as any numeric digit. This can be an advantage, because the . metacharacter will match almost anything, which can be useful if you aren’t too concerned about exactly what you match or how many matches you end up with. The disadvantage of the . metacharacter is the same — it will match almost anything. For example, in a search-and-replace operation, replacing the sequence of characters that match the . metacharacter can be very dangerous, with results similar to, but potentially wider in scope than, the replacement of startling by Moontling that you saw in the Star Training Company example in Chapter 1.
Try It Out |
The Period (.) Metacharacter |
Using the Komodo Regular Expression Toolkit, you can experiment with using the period and then entering alphabetic and numeric characters as test text. Remember that the Komodo Regular Expression Toolkit matches only the first occurrence of any character.
1.Open the Komodo development environment.
2.Click the button for the Komodo Regular Expressions Toolkit, and clear any regular expression and test string in the toolkit.
3.Enter a test string in the Enter a String to Match Against area. The test string is Andrew.
4.Enter a period in the Enter a Regular Expression area of the toolkit, and inspect the result, which is displayed immediately below the Enter a String to Match Against area.
The result in this case is Match succeeded: 0 groups. The concept of groups is discussed in Chapter 7.
The . metacharacter matches any alphabetic character used in English, any numeric digit, whitespace characters such as the space character, and a very large number of alphabetic characters used in languages other than English. Figure 4-1 shows the . metacharacter in the Komodo Regular Expression Toolkit matching an uppercase A.
75
Chapter 4
Figure 4-1
How It Works
When the . metacharacter occurs in a regular expression pattern, the regular expression engine attempts to match it against any uppercase or lowercase English alphabetic character or any numeric digit. In addition, a very large number of non–English-language characters will match.
The regular expression engine begins attempting to find a match at the position immediately before the initial A of Andrew. The first character of the test text, A, is tested as a possible match for the . metacharacter. It matches. So the initial A is outlined in pale green, indicating that it is the first match.
The . metacharacter also matches alphabetic characters in languages other than English.
Try It Out |
The . Metacharacter Matching Non-English Characters |
If you have closed the Komodo Regular Expression Toolkit, follow all of the following steps. If you have kept the toolkit open, start at Step 2.
1.Open the Komodo development environment, and click the button for the Komodo Regular Expressions Toolkit.
2.Clear any regular expression and/or test string in the toolkit.
3.Open the Windows Character Map. In Windows XP, you can do that by selecting Start All Programs Accessories System Tools and, finally, selecting Character Map.
4.Click once on the scroll bar to the right of the Character Map window. Click the uppercase Ω character (omega), and you should see something similar to that shown in Figure 4-2.
5.With the uppercase Ω selected, click the Select button. The Ω character should appear in the Character Map window’s Characters to Copy text box.
76
Metacharacters and Modifiers
Figure 4-2
6.Click the Copy button in the Character Map window.
7.Enter a test string in the Enter a String to Match Against area of the Komodo Regular Expression
Toolkit by clicking in the Enter a String to Match Against area and pressing Ctrl+V to paste. The test string is Ω.
8.Enter a period in the Enter a Regular Expression area of the toolkit, and inspect the result, which is displayed immediately below the Enter a String to Match Against area. Notice, too, that the uppercase omega is highlighted in pale green on-screen, indicating that it is a match for the . metacharacter.
How It Works
The regular expression engine attempts to match the . metacharacter against any character that is not a newline. An attempt at matching begins at the position immediately before the uppercase omega. The first character, the uppercase omega, matches the . metacharacter. Because the uppercase omega is a character that isn’t a newline, there is a match. Because the entire regular expression is matched (there is only a single metacharacter on this occasion), matching is complete and successful.
Referring back to Figure 4-2, you can see the . metacharacter matching the Greek uppercase letter omega.
You can also try the . metacharacter with any numeric digit or sequence of numeric digits — for example, 234 — and you will see that the . metacharacter matches any numeric digit from 0 through 9.
Using the . metacharacter with any English text is very straightforward. In most circumstances, it will match anything except a newline. However, the matching characteristics of the . metacharacter can be modified to match a newline. In the Komodo Regular Expression Toolkit, this can be done using the single-line mode.
77