- •Introduction
- •Who This Book Is For
- •What This Book Covers
- •How This Book Is Structured
- •What You Need to Use This Book
- •Conventions
- •Source Code
- •Errata
- •p2p.wrox.com
- •What Are Regular Expressions?
- •What Can Regular Expressions Be Used For?
- •Finding Doubled Words
- •Checking Input from Web Forms
- •Changing Date Formats
- •Finding Incorrect Case
- •Adding Links to URLs
- •Regular Expressions You Already Use
- •Search and Replace in Word Processors
- •Directory Listings
- •Online Searching
- •Why Regular Expressions Seem Intimidating
- •Compact, Cryptic Syntax
- •Whitespace Can Significantly Alter the Meaning
- •No Standards Body
- •Differences between Implementations
- •Characters Change Meaning in Different Contexts
- •Regular Expressions Can Be Case Sensitive
- •Case-Sensitive and Case-Insensitive Matching
- •Case and Metacharacters
- •Continual Evolution in Techniques Supported
- •Multiple Solutions for a Single Problem
- •What You Want to Do with a Regular Expression
- •Replacing Text in Quantity
- •Regular Expression Tools
- •findstr
- •Microsoft Word
- •StarOffice Writer/OpenOffice.org Writer
- •Komodo Rx Package
- •PowerGrep
- •Microsoft Excel
- •JavaScript and JScript
- •VBScript
- •Visual Basic.NET
- •Java
- •Perl
- •MySQL
- •SQL Server 2000
- •W3C XML Schema
- •An Analytical Approach to Using Regular Expressions
- •Express and Document What You Want to Do in English
- •Consider the Regular Expression Options Available
- •Consider Sensitivity and Specificity
- •Create Appropriate Regular Expressions
- •Document All but Simple Regular Expressions
- •Document What You Expect the Regular Expression to Do
- •Document What You Want to Match
- •Test the Results of a Regular Expression
- •Matching Single Characters
- •Matching Sequences of Characters That Each Occur Once
- •Introducing Metacharacters
- •Matching Sequences of Different Characters
- •Matching Optional Characters
- •Matching Multiple Optional Characters
- •Other Cardinality Operators
- •The * Quantifier
- •The + Quantifier
- •The Curly-Brace Syntax
- •The {n} Syntax
- •The {n,m} Syntax
- •Exercises
- •Regular Expression Metacharacters
- •Thinking about Characters and Positions
- •The Period (.) Metacharacter
- •Matching Variably Structured Part Numbers
- •Matching a Literal Period
- •The \w Metacharacter
- •The \W Metacharacter
- •Digits and Nondigits
- •The \d Metacharacter
- •Canadian Postal Code Example
- •The \D Metacharacter
- •Alternatives to \d and \D
- •The \s Metacharacter
- •Handling Optional Whitespace
- •The \S Metacharacter
- •The \t Metacharacter
- •The \n Metacharacter
- •Escaped Characters
- •Finding the Backslash
- •Modifiers
- •Global Search
- •Case-Insensitive Search
- •Exercises
- •Introduction to Character Classes
- •Choice between Two Characters
- •Using Quantifiers with Character Classes
- •Using the \b Metacharacter in Character Classes
- •Selecting Literal Square Brackets
- •Using Ranges in Character Classes
- •Alphabetic Ranges
- •Use [A-z] With Care
- •Digit Ranges in Character Classes
- •Hexadecimal Numbers
- •IP Addresses
- •Reverse Ranges in Character Classes
- •A Potential Range Trap
- •Finding HTML Heading Elements
- •Metacharacter Meaning within Character Classes
- •The ^ metacharacter
- •How to Use the - Metacharacter
- •Negated Character Classes
- •Combining Positive and Negative Character Classes
- •POSIX Character Classes
- •The [:alnum:] Character Class
- •Exercises
- •String, Line, and Word Boundaries
- •The ^ Metacharacter
- •The ^ Metacharacter and Multiline Mode
- •The $ Metacharacter
- •The $ Metacharacter in Multiline Mode
- •Using the ^ and $ Metacharacters Together
- •Matching Blank Lines
- •Working with Dollar Amounts
- •Revisiting the IP Address Example
- •What Is a Word?
- •Identifying Word Boundaries
- •The \< Syntax
- •The \>Syntax
- •The \b Syntax
- •The \B Metacharacter
- •Less-Common Word-Boundary Metacharacters
- •Exercises
- •Grouping Using Parentheses
- •Parentheses and Quantifiers
- •Matching Literal Parentheses
- •U.S. Telephone Number Example
- •Alternation
- •Choosing among Multiple Options
- •Unexpected Alternation Behavior
- •Capturing Parentheses
- •Numbering of Captured Groups
- •Numbering When Using Nested Parentheses
- •Named Groups
- •Non-Capturing Parentheses
- •Back References
- •Exercises
- •Why You Need Lookahead and Lookbehind
- •The (? metacharacters
- •Lookahead
- •Positive Lookahead
- •Negative Lookahead
- •Positive Lookahead Examples
- •Positive Lookahead in the Same Document
- •Inserting an Apostrophe
- •Lookbehind
- •Positive Lookbehind
- •Negative Lookbehind
- •How to Match Positions
- •Adding Commas to Large Numbers
- •Exercises
- •What Are Sensitivity and Specificity?
- •Extreme Sensitivity, Awful Specificity
- •Email Addresses Example
- •Replacing Hyphens Example
- •The Sensitivity/Specificity Trade-Off
- •Sensitivity, Specificity, and Positional Characters
- •Sensitivity, Specificity, and Modes
- •Sensitivity, Specificity, and Lookahead and Lookbehind
- •How Much Should the Regular Expressions Do?
- •Abbreviations
- •Characters from Other Languages
- •Names
- •Sensitivity and How to Achieve It
- •Specificity and How to Maximize It
- •Exercises
- •Documenting Regular Expressions
- •Document the Problem Definition
- •Add Comments to Your Code
- •Making Use of Extended Mode
- •Know Your Data
- •Abbreviations
- •Proper Names
- •Incorrect Spelling
- •Creating Test Cases
- •Debugging Regular Expressions
- •Treacherous Whitespace
- •Backslashes Causing Problems
- •Considering Other Causes
- •The User Interface
- •Metacharacters Available
- •Quantifiers
- •The @ Quantifier
- •The {n,m} Syntax
- •Modes
- •Character Classes
- •Back References
- •Lookahead and Lookbehind
- •Lazy Matching versus Greedy Matching
- •Examples
- •Character Class Examples, Including Ranges
- •Whole Word Searches
- •Search-and-Replace Examples
- •Changing Name Structure Using Back References
- •Manipulating Dates
- •The Star Training Company Example
- •Regular Expressions in Visual Basic for Applications
- •Exercises
- •The User Interface
- •Metacharacters Available
- •Quantifiers
- •Modes
- •Character Classes
- •Alternation
- •Back References
- •Lookahead and Lookbehind
- •Search Example
- •Search-and-Replace Example
- •Online Chats
- •POSIX Character Classes
- •Matching Numeric Digits
- •Exercises
- •Introducing findstr
- •Finding Literal Text
- •Quantifiers
- •Character Classes
- •Command-Line Switch Examples
- •The /v Switch
- •The /a Switch
- •Single File Examples
- •Simple Character Class Example
- •Find Protocols Example
- •Multiple File Example
- •A Filelist Example
- •Exercises
- •The PowerGREP Interface
- •A Simple Find Example
- •The Replace Tab
- •The File Finder Tab
- •Syntax Coloring
- •Other Tabs
- •Numeric Digits and Alphabetic Characters
- •Quantifiers
- •Back References
- •Alternation
- •Line Position Metacharacters
- •Word-Boundary Metacharacters
- •Lookahead and Lookbehind
- •Longer Examples
- •Finding HTML Horizontal Rule Elements
- •Matching Time Example
- •Exercises
- •The Excel Find Interface
- •Escaping Wildcard Characters
- •Using Wildcards in Data Forms
- •Using Wildcards in Filters
- •Exercises
- •Using LIKE with Regular Expressions
- •The % Metacharacter
- •The _ Metacharacter
- •Character Classes
- •Negated Character Classes
- •Using Full-Text Search
- •Using The CONTAINS Predicate
- •Document Filters on Image Columns
- •Exercises
- •Using the _ and % Metacharacters
- •Testing Matching of Literals: _ and % Metacharacters
- •Using Positional Metacharacters
- •Using Character Classes
- •Quantifiers
- •Social Security Number Example
- •Exercises
- •The Interface to Metacharacters in Microsoft Access
- •Creating a Hard-Wired Query
- •Creating a Parameter Query
- •Using the ? Metacharacter
- •Using the * Metacharacter
- •Using the # Metacharacter
- •Using the # Character with Date/Time Data
- •Using Character Classes in Access
- •Exercises
- •The RegExp Object
- •Attributes of the RegExp Object
- •The Other Properties of the RegExp Object
- •The test() Method of the RegExp Object
- •The exec() Method of the RegExp Object
- •The String Object
- •Metacharacters in JavaScript and JScript
- •SSN Validation Example
- •Exercises
- •The RegExp Object and How to Use It
- •Quantifiers
- •Positional Metacharacters
- •Character Classes
- •Word Boundaries
- •Lookahead
- •Grouping and Nongrouping Parentheses
- •Exercises
- •The System.Text.RegularExpressions namespace
- •A Simple Visual Basic .NET Example
- •The Classes of System.Text.RegularExpressions
- •The Regex Object
- •Using the Match Object and Matches Collection
- •Using the Match.Success Property and Match.NextMatch Method
- •The GroupCollection and Group Classes
- •The CaptureCollection and Capture Class
- •The RegexOptions Enumeration
- •Case-Insensitive Matching: The IgnoreCase Option
- •Multiline Matching: The Effect on the ^ and $ Metacharacters
- •Right to Left Matching: The RightToLeft Option
- •Lookahead and Lookbehind
- •Exercises
- •An Introductory Example
- •The Classes of System.Text.RegularExpressions
- •The Regex Class
- •The Options Property of the Regex Class
- •Regex Class Methods
- •The CompileToAssembly() Method
- •The GetGroupNames() Method
- •The GetGroupNumbers() Method
- •GroupNumberFromName() and GroupNameFromNumber() Methods
- •The IsMatch() Method
- •The Match() Method
- •The Matches() Method
- •The Replace() Method
- •The Split() Method
- •Using the Static Methods of the Regex Class
- •The IsMatch() Method as a Static
- •The Match() Method as a Static
- •The Matches() Method as a Static
- •The Replace() Method as a Static
- •The Split() Method as a Static
- •The Match and Matches Classes
- •The Match Class
- •The GroupCollection and Group Classes
- •The RegexOptions Class
- •The IgnorePatternWhitespace Option
- •Metacharacters Supported in Visual C# .NET
- •Using Named Groups
- •Using Back References
- •Exercise
- •The ereg() Set of Functions
- •The ereg() Function
- •The ereg() Function with Three Arguments
- •The eregi() Function
- •The ereg_replace() Function
- •The eregi_replace() Function
- •The split() Function
- •The spliti() Function
- •The sql_regcase() Function
- •Perl Compatible Regular Expressions
- •Pattern Delimiters in PCRE
- •Escaping Pattern Delimiters
- •Matching Modifiers in PCRE
- •Using the preg_match() Function
- •Using the preg_match_all() Function
- •Using the preg_grep() Function
- •Using the preg_quote() Function
- •Using the preg_replace() Function
- •Using the preg_replace_callback() Function
- •Using the preg_split() Function
- •Supported Metacharacters with ereg()
- •Using POSIX Character Classes with PHP
- •Supported Metacharacters with PCRE
- •Positional Metacharacters
- •Character Classes in PHP
- •Documenting PHP Regular Expressions
- •Exercises
- •W3C XML Schema Basics
- •Tools for Using W3C XML Schema
- •Comparing XML Schema and DTDs
- •How Constraints Are Expressed in W3C XML Schema
- •W3C XML Schema Datatypes
- •Derivation by Restriction
- •Unicode and W3C XML Schema
- •Unicode Overview
- •Using Unicode Character Classes
- •Matching Decimal Numbers
- •Mixing Unicode Character Classes with Other Metacharacters
- •Unicode Character Blocks
- •Using Unicode Character Blocks
- •Metacharacters Supported in W3C XML Schema
- •Positional Metacharacters
- •Matching Numeric Digits
- •Alternation
- •Using the \w and \s Metacharacters
- •Escaping Metacharacters
- •Exercises
- •Introduction to the java.util.regex Package
- •Obtaining and Installing Java
- •The Pattern Class
- •Using the matches() Method Statically
- •Two Simple Java Examples
- •The Properties (Fields) of the Pattern Class
- •The CASE_INSENSITIVE Flag
- •Using the COMMENTS Flag
- •The DOTALL Flag
- •The MULTILINE Flag
- •The UNICODE_CASE Flag
- •The UNIX_LINES Flag
- •The Methods of the Pattern Class
- •The compile() Method
- •The flags() Method
- •The matcher() Method
- •The matches() Method
- •The pattern() Method
- •The split() Method
- •The Matcher Class
- •The appendReplacement() Method
- •The appendTail() Method
- •The end() Method
- •The find() Method
- •The group() Method
- •The groupCount() Method
- •The lookingAt() Method
- •The matches() Method
- •The pattern() Method
- •The replaceAll() Method
- •The replaceFirst() Method
- •The reset() Method
- •The start() Method
- •The PatternSyntaxException Class
- •Using the \d Metacharacter
- •Character Classes
- •The POSIX Character Classes in the java.util.regex Package
- •Unicode Character Classes and Character Blocks
- •Using Escaped Characters
- •Using Methods of the String Class
- •Using the matches() Method
- •Using the replaceFirst() Method
- •Using the replaceAll() Method
- •Using the split() Method
- •Exercises
- •Obtaining and Installing Perl
- •Creating a Simple Perl Program
- •Basics of Perl Regular Expression Usage
- •Using the m// Operator
- •Using Other Regular Expression Delimiters
- •Matching Using Variable Substitution
- •Using the s/// Operator
- •Using s/// with the Global Modifier
- •Using s/// with the Default Variable
- •Using the split Operator
- •Using Quantifiers in Perl
- •Using Positional Metacharacters
- •Captured Groups in Perl
- •Using Back References in Perl
- •Using Alternation
- •Using Character Classes in Perl
- •Using Lookahead
- •Using Lookbehind
- •Escaping Metacharacters
- •A Simple Perl Regex Tester
- •Exercises
- •Index
Chapter 13
Quantifiers
Support for quantifiers in findstr is limited. The * quantifier is supported with the standard meaning of zero or more occurrences. However, neither the ? quantifier nor the + quantifier is supported; neither is the {n,m} quantifier notation supported.
The test files Order1.txt and Order2.txt show how the * quantifier can be used.
The content of Order1.txt is shown here:
This is an order for Part No. ABC123.
Blah blah. As easy as ABC.
2004/08/20
The content of Order2.txt is here:
This is an order for Part No. ABC456.
Blah blah.
2003/07/18
For the purposes of this example, the part number is the focus of interest. In many regular expression implementations you would use ABC\d{3} or ABC[0-9]{3} to match exactly three digits, but findstr does not support that syntax.
Try It Out |
The * Quantifier |
1.Open a command window, and type the following command at the command prompt:
findstr /n “ABC [0-9]*” Order*.txt
2.Inspect the results returned, as shown in Figure 13-6. Notice that three lines contain a match. The second of the displayed lines is undesired because the occurrence of ABC with no following numeric digit is not a part number.
Figure 13-6
3.To match the desired number of numeric digits, exactly three, use the following pattern:
ABC[0-9][0-9][0-9]
310
Regular Expressions Using findstr
4.At the command line, enter the following command:
findstr /n “ABC[0-9][0-9][0-9]” Orders*.txt
Figure 13-7 shows the results.
Figure 13-7
How It Works
After Step 2, the two lines that contain part numbers consisting of the character sequence ABC followed by three numeric digits are matched, which is what you want. However, the second line in Orders1.txt is also matched, because the pattern [0-9]* matches zero or more occurrences of the character class that matches numeric digits. Because ABC in As easy as ABC. has zero occurrences of a numeric digit, the pattern ABC[0-9]* is matched, because the character sequence ABC is present together with zero occurrences of a numeric digit.
Back references, lookahead, and lookbehind are not supported in the findstr utility.
Character Classes
As you saw in an earlier example in this chapter, the character class [0-9] is supported in findstr. In fact, the character class [0-9], or one of the alternative ways of defining a character class, [0123456789], is needed because findstr does not support the \d metacharacter.
The following text, contained in the file PartNums.txt, is the test file:
ABC123
DEF890
GHI234
HKO838
RUV991
ILR246
UVW991
ADF274
DRX119
In findstr ranges are supported, as are negated character classes.
311
Chapter 13
Try It Out |
Character Classes |
1.Open a command window, and navigate to the directory containing the file PartNums.txt.
2.Type the following at the command line:
findstr /n “A[A-Z][A-Z][0-9][0-9][0-9]” PartNums.txt
3.Inspect the results, as shown in Figure 13-8. The lines containing part numbers that begin with uppercase A, have two uppercase letters following, and have three numeric digits are displayed.
Figure 13-8
Because of the way that findstr works, you could have used a simpler pattern, A[A-Z] [A-Z][0-9], given the sample data. If there were part numbers such as ABC1 in the test text, the preceding pattern would match lines containing part numbers like that, which may not be what you want.
4.Type the following command at the command line:
findstr /n “A[A-Z][A-Z][0-9]” PartNums.txt
5.Inspect the results. (Notice that the same lines are matched.)
6.If you want to match part numbers that begin with an uppercase A but that do not have an uppercase B as the second character in the part number, you can use the negated character class [^B] to achieve that.
At the command line, type the following command:
findstr /n “A[^B][A-Z][0-9]” PartNums.txt
7.Inspect the results (notice that Line 1 no longer matches), as shown in Figure 13-9.
Figure 13-9
How It Works
In Step 2, the pattern A[A-Z][A-Z][0-9][0-9][0-9] is used. The A matches uppercase A literally. Because only Line 1 and Line 15 contain a part number beginning with A, only those lines are possible matches for the rest of the regular expression. The character class [A-Z] matches any alphabetic character,
312
Regular Expressions Using findstr
matching B on Line 1 and D on Line 15. The second occurrence of the character class [A-Z] in the regular expression matches C on Line 1 and F on Line 15. The three character classes [0-9][0-9][0-9] match three successive numeric digits, 123 on Line 1 and 274 on Line 15. So there are matches on lines 1 and 15.
In Step 6, the pattern A[^B][A-Z][0-9] is used. The initial A matches on lines 1 and 15 as before. The character class [^B] matches any character except uppercase B. So there is no match for A[^B] on Line 1. However, on Line 15, D is a match for [^B], so matching continues on Line 15. The [A-B] pattern matches the F on Line 15, and [0-9][0-9][0-9] matches 274 on Line 15. So the only match is on Line 15.
There is a risk in having a character class such as [^B] in a pattern, because that is almost equivalent to the dot character. So if a malformed part number A$C123 were in the test file, it would match the pattern [A-Z][^B][A-Z][0-9][0-9][0-9]. If the intent was that any uppercase character except B was desired, a more specific character class would be [AC-Z]. So the regular expression would be
A[AC-Z][A-Z][0-9][0-9][0-9]
with [AC-Z] having the same meaning as [ACDEFGHIJKLMNOPQRSTUVWXYZ].
Word-Boundar y Positions
The findstr utility supports separate metacharacters that match the beginning-of-word position and the end-of-word position. The \< metacharacter matches the beginning-of-word position, and the \> metacharacter matches the end-of-word position.
A test file, Word.txt, has the following content:
Swords are sharp, typically.
Words are powerful things. They can wound.
Churchill is a byword for wartime persistence.
Do you have a favorite word?
His surname is Answord.
Wordsworth was a famous English poet.
Word by word is, typically, not a good method of translation.
Notice that the character sequence word occurs at the beginning or end of a sequence of alphabetic characters or embedded inside a longer character sequence. Notice, too, that sometimes an uppercase character is part of word or Word, so you must take care in how you use case-sensitive or case-insensitive search.
Try It Out |
Beginningand End-of-Word Positions |
1.Open a command window, and navigate to the directory containing the Word.txt test file.
2.At the command prompt, enter the following command:
findstr /n “Word” Word.txt
313
Chapter 13
3.Inspect the results, as shown in Figure 13-10. Notice that only three of the seven lines containing text are displayed. This is so because the default behavior of findstr is case-sensitive matching.
Figure 13-10
4.To ensure that all occurrences of the character sequence word are displayed, you can use the /i command-line switch.
At the command line, enter the following command:
findstr /n /i “Word” Word.txt
5.Inspect the results. Now all seven lines containing text are displayed. So you can be confident that all occurrences of the character sequence word are now displayed.
6.Next, let’s look at the effect of the beginning-of-word position metacharacter, \<.
At the command prompt, enter the following command:
findstr /n /i “\<Word” Word.txt
7.Inspect the results, as shown in Figure 13-11. As you can see, only four of the seven lines that contain text are displayed. Each of the lines contains the character sequence word or Word (remember the matching is case insensitive) with that character sequence at the beginning of an alphabetic character sequence (in effect, at the beginning of what you would typically call a “word”).
Figure 13-11
8.You can add the end-of-word position metacharacter, \>, to the regular expression to make the matching more specific, matching only when the character sequence word or Word is preceded by a beginning-of-word position and followed by an end-of-word position.
At the command prompt, enter the following command:
findstr /n /i “\<Word\>” Word.text
9.Inspect the results, as shown in Figure 13-12. Now only two lines are displayed. On each line the character sequence word is actually just that — a word. Strictly speaking, the beginning-of-word position and end-of-word position metacharacters mark the beginning and end of an alphabetic sequence, respectively. For many practical purposes, they signify the beginning and end of a word.
314
Regular Expressions Using findstr
Figure 13-12
Beginningand End-of-Line Positions
The findstr utility offers two quite distinct ways to specify that matching is to take place at the beginning or end of a line. First, there are the /b and /e switches, which specify matching at the beginning and end of a line, respectively. Second, there are the ^ and $ metacharacters.
The content of the test file, Low.txt, is shown here:
Low is the opposite of high.
A Ferrari isn’t usually thought of as slow.
Slow, slow, quick, quick, slow
Slow, slow, quick, quick, slow.
Allow me to to pass please.
Lowering sky over a blackened sea.
Try It Out Beginningand End-of-Line Positions
1.Open a command window, and navigate to the directory containing the file Low.txt.
2.At the command prompt, enter the following command:
findstr /n /i “Low” Low.txt
3.Inspect the results. All six lines that contain text are displayed because the character sequence low, matched case insensitively (notice the /i switch), is present on all six lines.
4.Next, test the /b switch, which limits matching to the beginning of a line.
At the command line, enter the following command:
findstr /n /i /b “Low” Low.txt
5.Inspect the results, as shown in Figure 13-13. Now only two lines are displayed, each of which has the character sequence Low as its first three characters.
Figure 13-13
315