- •Introduction
- •Who This Book Is For
- •What This Book Covers
- •How This Book Is Structured
- •What You Need to Use This Book
- •Conventions
- •Source Code
- •Errata
- •p2p.wrox.com
- •What Are Regular Expressions?
- •What Can Regular Expressions Be Used For?
- •Finding Doubled Words
- •Checking Input from Web Forms
- •Changing Date Formats
- •Finding Incorrect Case
- •Adding Links to URLs
- •Regular Expressions You Already Use
- •Search and Replace in Word Processors
- •Directory Listings
- •Online Searching
- •Why Regular Expressions Seem Intimidating
- •Compact, Cryptic Syntax
- •Whitespace Can Significantly Alter the Meaning
- •No Standards Body
- •Differences between Implementations
- •Characters Change Meaning in Different Contexts
- •Regular Expressions Can Be Case Sensitive
- •Case-Sensitive and Case-Insensitive Matching
- •Case and Metacharacters
- •Continual Evolution in Techniques Supported
- •Multiple Solutions for a Single Problem
- •What You Want to Do with a Regular Expression
- •Replacing Text in Quantity
- •Regular Expression Tools
- •findstr
- •Microsoft Word
- •StarOffice Writer/OpenOffice.org Writer
- •Komodo Rx Package
- •PowerGrep
- •Microsoft Excel
- •JavaScript and JScript
- •VBScript
- •Visual Basic.NET
- •Java
- •Perl
- •MySQL
- •SQL Server 2000
- •W3C XML Schema
- •An Analytical Approach to Using Regular Expressions
- •Express and Document What You Want to Do in English
- •Consider the Regular Expression Options Available
- •Consider Sensitivity and Specificity
- •Create Appropriate Regular Expressions
- •Document All but Simple Regular Expressions
- •Document What You Expect the Regular Expression to Do
- •Document What You Want to Match
- •Test the Results of a Regular Expression
- •Matching Single Characters
- •Matching Sequences of Characters That Each Occur Once
- •Introducing Metacharacters
- •Matching Sequences of Different Characters
- •Matching Optional Characters
- •Matching Multiple Optional Characters
- •Other Cardinality Operators
- •The * Quantifier
- •The + Quantifier
- •The Curly-Brace Syntax
- •The {n} Syntax
- •The {n,m} Syntax
- •Exercises
- •Regular Expression Metacharacters
- •Thinking about Characters and Positions
- •The Period (.) Metacharacter
- •Matching Variably Structured Part Numbers
- •Matching a Literal Period
- •The \w Metacharacter
- •The \W Metacharacter
- •Digits and Nondigits
- •The \d Metacharacter
- •Canadian Postal Code Example
- •The \D Metacharacter
- •Alternatives to \d and \D
- •The \s Metacharacter
- •Handling Optional Whitespace
- •The \S Metacharacter
- •The \t Metacharacter
- •The \n Metacharacter
- •Escaped Characters
- •Finding the Backslash
- •Modifiers
- •Global Search
- •Case-Insensitive Search
- •Exercises
- •Introduction to Character Classes
- •Choice between Two Characters
- •Using Quantifiers with Character Classes
- •Using the \b Metacharacter in Character Classes
- •Selecting Literal Square Brackets
- •Using Ranges in Character Classes
- •Alphabetic Ranges
- •Use [A-z] With Care
- •Digit Ranges in Character Classes
- •Hexadecimal Numbers
- •IP Addresses
- •Reverse Ranges in Character Classes
- •A Potential Range Trap
- •Finding HTML Heading Elements
- •Metacharacter Meaning within Character Classes
- •The ^ metacharacter
- •How to Use the - Metacharacter
- •Negated Character Classes
- •Combining Positive and Negative Character Classes
- •POSIX Character Classes
- •The [:alnum:] Character Class
- •Exercises
- •String, Line, and Word Boundaries
- •The ^ Metacharacter
- •The ^ Metacharacter and Multiline Mode
- •The $ Metacharacter
- •The $ Metacharacter in Multiline Mode
- •Using the ^ and $ Metacharacters Together
- •Matching Blank Lines
- •Working with Dollar Amounts
- •Revisiting the IP Address Example
- •What Is a Word?
- •Identifying Word Boundaries
- •The \< Syntax
- •The \>Syntax
- •The \b Syntax
- •The \B Metacharacter
- •Less-Common Word-Boundary Metacharacters
- •Exercises
- •Grouping Using Parentheses
- •Parentheses and Quantifiers
- •Matching Literal Parentheses
- •U.S. Telephone Number Example
- •Alternation
- •Choosing among Multiple Options
- •Unexpected Alternation Behavior
- •Capturing Parentheses
- •Numbering of Captured Groups
- •Numbering When Using Nested Parentheses
- •Named Groups
- •Non-Capturing Parentheses
- •Back References
- •Exercises
- •Why You Need Lookahead and Lookbehind
- •The (? metacharacters
- •Lookahead
- •Positive Lookahead
- •Negative Lookahead
- •Positive Lookahead Examples
- •Positive Lookahead in the Same Document
- •Inserting an Apostrophe
- •Lookbehind
- •Positive Lookbehind
- •Negative Lookbehind
- •How to Match Positions
- •Adding Commas to Large Numbers
- •Exercises
- •What Are Sensitivity and Specificity?
- •Extreme Sensitivity, Awful Specificity
- •Email Addresses Example
- •Replacing Hyphens Example
- •The Sensitivity/Specificity Trade-Off
- •Sensitivity, Specificity, and Positional Characters
- •Sensitivity, Specificity, and Modes
- •Sensitivity, Specificity, and Lookahead and Lookbehind
- •How Much Should the Regular Expressions Do?
- •Abbreviations
- •Characters from Other Languages
- •Names
- •Sensitivity and How to Achieve It
- •Specificity and How to Maximize It
- •Exercises
- •Documenting Regular Expressions
- •Document the Problem Definition
- •Add Comments to Your Code
- •Making Use of Extended Mode
- •Know Your Data
- •Abbreviations
- •Proper Names
- •Incorrect Spelling
- •Creating Test Cases
- •Debugging Regular Expressions
- •Treacherous Whitespace
- •Backslashes Causing Problems
- •Considering Other Causes
- •The User Interface
- •Metacharacters Available
- •Quantifiers
- •The @ Quantifier
- •The {n,m} Syntax
- •Modes
- •Character Classes
- •Back References
- •Lookahead and Lookbehind
- •Lazy Matching versus Greedy Matching
- •Examples
- •Character Class Examples, Including Ranges
- •Whole Word Searches
- •Search-and-Replace Examples
- •Changing Name Structure Using Back References
- •Manipulating Dates
- •The Star Training Company Example
- •Regular Expressions in Visual Basic for Applications
- •Exercises
- •The User Interface
- •Metacharacters Available
- •Quantifiers
- •Modes
- •Character Classes
- •Alternation
- •Back References
- •Lookahead and Lookbehind
- •Search Example
- •Search-and-Replace Example
- •Online Chats
- •POSIX Character Classes
- •Matching Numeric Digits
- •Exercises
- •Introducing findstr
- •Finding Literal Text
- •Quantifiers
- •Character Classes
- •Command-Line Switch Examples
- •The /v Switch
- •The /a Switch
- •Single File Examples
- •Simple Character Class Example
- •Find Protocols Example
- •Multiple File Example
- •A Filelist Example
- •Exercises
- •The PowerGREP Interface
- •A Simple Find Example
- •The Replace Tab
- •The File Finder Tab
- •Syntax Coloring
- •Other Tabs
- •Numeric Digits and Alphabetic Characters
- •Quantifiers
- •Back References
- •Alternation
- •Line Position Metacharacters
- •Word-Boundary Metacharacters
- •Lookahead and Lookbehind
- •Longer Examples
- •Finding HTML Horizontal Rule Elements
- •Matching Time Example
- •Exercises
- •The Excel Find Interface
- •Escaping Wildcard Characters
- •Using Wildcards in Data Forms
- •Using Wildcards in Filters
- •Exercises
- •Using LIKE with Regular Expressions
- •The % Metacharacter
- •The _ Metacharacter
- •Character Classes
- •Negated Character Classes
- •Using Full-Text Search
- •Using The CONTAINS Predicate
- •Document Filters on Image Columns
- •Exercises
- •Using the _ and % Metacharacters
- •Testing Matching of Literals: _ and % Metacharacters
- •Using Positional Metacharacters
- •Using Character Classes
- •Quantifiers
- •Social Security Number Example
- •Exercises
- •The Interface to Metacharacters in Microsoft Access
- •Creating a Hard-Wired Query
- •Creating a Parameter Query
- •Using the ? Metacharacter
- •Using the * Metacharacter
- •Using the # Metacharacter
- •Using the # Character with Date/Time Data
- •Using Character Classes in Access
- •Exercises
- •The RegExp Object
- •Attributes of the RegExp Object
- •The Other Properties of the RegExp Object
- •The test() Method of the RegExp Object
- •The exec() Method of the RegExp Object
- •The String Object
- •Metacharacters in JavaScript and JScript
- •SSN Validation Example
- •Exercises
- •The RegExp Object and How to Use It
- •Quantifiers
- •Positional Metacharacters
- •Character Classes
- •Word Boundaries
- •Lookahead
- •Grouping and Nongrouping Parentheses
- •Exercises
- •The System.Text.RegularExpressions namespace
- •A Simple Visual Basic .NET Example
- •The Classes of System.Text.RegularExpressions
- •The Regex Object
- •Using the Match Object and Matches Collection
- •Using the Match.Success Property and Match.NextMatch Method
- •The GroupCollection and Group Classes
- •The CaptureCollection and Capture Class
- •The RegexOptions Enumeration
- •Case-Insensitive Matching: The IgnoreCase Option
- •Multiline Matching: The Effect on the ^ and $ Metacharacters
- •Right to Left Matching: The RightToLeft Option
- •Lookahead and Lookbehind
- •Exercises
- •An Introductory Example
- •The Classes of System.Text.RegularExpressions
- •The Regex Class
- •The Options Property of the Regex Class
- •Regex Class Methods
- •The CompileToAssembly() Method
- •The GetGroupNames() Method
- •The GetGroupNumbers() Method
- •GroupNumberFromName() and GroupNameFromNumber() Methods
- •The IsMatch() Method
- •The Match() Method
- •The Matches() Method
- •The Replace() Method
- •The Split() Method
- •Using the Static Methods of the Regex Class
- •The IsMatch() Method as a Static
- •The Match() Method as a Static
- •The Matches() Method as a Static
- •The Replace() Method as a Static
- •The Split() Method as a Static
- •The Match and Matches Classes
- •The Match Class
- •The GroupCollection and Group Classes
- •The RegexOptions Class
- •The IgnorePatternWhitespace Option
- •Metacharacters Supported in Visual C# .NET
- •Using Named Groups
- •Using Back References
- •Exercise
- •The ereg() Set of Functions
- •The ereg() Function
- •The ereg() Function with Three Arguments
- •The eregi() Function
- •The ereg_replace() Function
- •The eregi_replace() Function
- •The split() Function
- •The spliti() Function
- •The sql_regcase() Function
- •Perl Compatible Regular Expressions
- •Pattern Delimiters in PCRE
- •Escaping Pattern Delimiters
- •Matching Modifiers in PCRE
- •Using the preg_match() Function
- •Using the preg_match_all() Function
- •Using the preg_grep() Function
- •Using the preg_quote() Function
- •Using the preg_replace() Function
- •Using the preg_replace_callback() Function
- •Using the preg_split() Function
- •Supported Metacharacters with ereg()
- •Using POSIX Character Classes with PHP
- •Supported Metacharacters with PCRE
- •Positional Metacharacters
- •Character Classes in PHP
- •Documenting PHP Regular Expressions
- •Exercises
- •W3C XML Schema Basics
- •Tools for Using W3C XML Schema
- •Comparing XML Schema and DTDs
- •How Constraints Are Expressed in W3C XML Schema
- •W3C XML Schema Datatypes
- •Derivation by Restriction
- •Unicode and W3C XML Schema
- •Unicode Overview
- •Using Unicode Character Classes
- •Matching Decimal Numbers
- •Mixing Unicode Character Classes with Other Metacharacters
- •Unicode Character Blocks
- •Using Unicode Character Blocks
- •Metacharacters Supported in W3C XML Schema
- •Positional Metacharacters
- •Matching Numeric Digits
- •Alternation
- •Using the \w and \s Metacharacters
- •Escaping Metacharacters
- •Exercises
- •Introduction to the java.util.regex Package
- •Obtaining and Installing Java
- •The Pattern Class
- •Using the matches() Method Statically
- •Two Simple Java Examples
- •The Properties (Fields) of the Pattern Class
- •The CASE_INSENSITIVE Flag
- •Using the COMMENTS Flag
- •The DOTALL Flag
- •The MULTILINE Flag
- •The UNICODE_CASE Flag
- •The UNIX_LINES Flag
- •The Methods of the Pattern Class
- •The compile() Method
- •The flags() Method
- •The matcher() Method
- •The matches() Method
- •The pattern() Method
- •The split() Method
- •The Matcher Class
- •The appendReplacement() Method
- •The appendTail() Method
- •The end() Method
- •The find() Method
- •The group() Method
- •The groupCount() Method
- •The lookingAt() Method
- •The matches() Method
- •The pattern() Method
- •The replaceAll() Method
- •The replaceFirst() Method
- •The reset() Method
- •The start() Method
- •The PatternSyntaxException Class
- •Using the \d Metacharacter
- •Character Classes
- •The POSIX Character Classes in the java.util.regex Package
- •Unicode Character Classes and Character Blocks
- •Using Escaped Characters
- •Using Methods of the String Class
- •Using the matches() Method
- •Using the replaceFirst() Method
- •Using the replaceAll() Method
- •Using the split() Method
- •Exercises
- •Obtaining and Installing Perl
- •Creating a Simple Perl Program
- •Basics of Perl Regular Expression Usage
- •Using the m// Operator
- •Using Other Regular Expression Delimiters
- •Matching Using Variable Substitution
- •Using the s/// Operator
- •Using s/// with the Global Modifier
- •Using s/// with the Default Variable
- •Using the split Operator
- •Using Quantifiers in Perl
- •Using Positional Metacharacters
- •Captured Groups in Perl
- •Using Back References in Perl
- •Using Alternation
- •Using Character Classes in Perl
- •Using Lookahead
- •Using Lookbehind
- •Escaping Metacharacters
- •A Simple Perl Regex Tester
- •Exercises
- •Index
Regular Expression Tools and an Approach to Using Them
Suppose that you use a simple regular expression to look for a word such as ball; then it will always match.
.*
Because that regular expression, simply stated, says, “Match zero or more alphanumeric characters,” it will match any word or sequence of characters. It has 100 percent sensitivity in that it will always match any occurrence of the word ball but essentially 0 percent specificity because it will match a host of undesired words.
Toward the other end of the spectrum, suppose that you want to match all occurrences of all forms of the word ball, including the plural form, balls, and the possessive form, ball’s. In that situation, using the following regular expression pattern will match only the singular form ball if it is in a context where only whole words are matched.
ball
Or it may match the first four letters of balls and ball’s, which may not be what you want.
To be able to discuss this topic in detail, you will need to understand more about regular expression syntax so that you can try out various options and show the effect of choices that you might make in designing your regular expression patterns.
Create Appropriate Regular Expressions
Once you have given careful thought to precisely what it is you want to do and have studied the data source sufficiently to give you a good understanding of what it contains, you are in a good position to create regular expression patterns appropriate to your needs. There is no magic formula that is appropriate for all situations. Only you, the developer, can decide precisely what you want to match and want not to match. To get the desired results you may need to carry out text manipulation in two steps. However, often you will be able to carry out a match or replace in one step by combining regular expression constructs.
Document All but Simple Regular Expressions
If you are creating regular expression patterns that go beyond simple patterns, I suggest that you seriously consider documenting the regular expression. Why do that? Think about the possible situation in 6 or 12 months’ time when you come back to your code, perhaps because it isn’t behaving exactly as users expect, and you can’t decipher the precise purpose of the regular expression. It’s in these situations that the truism that regular expressions can be difficult to decipher becomes very real. The existence of clear and complete documentation can prevent a lot of wasted time and frustration.
Several of the language-specific chapters later in this book will show you how you can document regular expression code. Some languages (for example, Perl) allow you to specify a mode that enables you to include inline comments about your regular expressions. Documenting each component of a regular expression pattern in that way makes it much easier to follow the intentions of the original developer
and either spot any flaws in the approach or analysis, or adapt it more easily to an altered business need.
35
Chapter 2
When you are using regular expressions interactively (such as in Microsoft Word or OpenOffice.org Writer), it makes little sense to document a regular expression, in part because typically when you use regular expressions (wildcards) interactively in a word processor, you use fairly simple regular expressions, and in part because a word processor doesn’t provide any standard way for you to document the regular expressions you use.
When creating more complicated regular expressions, there are three aspects of the regular expression that I suggest you consider documenting:
What you expect the regular expression to do
What you want to select
What you want not to select
The more complex the problem you are seeking to define and the more complex the regular expression pattern(s) you create, the more likely it is that you will want to take time to document each of these aspects.
Each of these aspects is discussed in the following sections.
Document What You Expect the Regular Expression to Do
Documenting the intention of a regular expression is useful particularly when you are creating a regular expression that is intermediate or higher in complexity. Of course, your perception of what is advanced or complex will change as your experience and skills in writing and interpreting regular expressions increase. You may well find, like many other users of regular expressions, that your intuitive feel for a regular expression and what you intended it to do fall off severely after a few weeks or months. To minimize the effects of that falloff in understanding, it is better to err on the side of too much documentation rather than too little.
How might you document a regular expression? Let’s return to the Star Training Company example. Depending on how you iterate through refinement of the problem definition, you may also find that you need to refine the documentation comments you include in your code.
I will make the assumption that for this project, you are working in Visual Basic .NET. A first attempt at documentation might look like this:
‘Replace Star with Moon
At first sight, this seems straightforward but, as you saw in Chapter 1, it is documenting an approach that can result in a messed-up document when it is applied without due attention to refining that initial thought.
Another attempt at defining what ought to be done might be the following:
‘Replace Star with Moon when it occurs as a whole word but leave Star
‘unchanged when it occurs as part of a word.
If you iterate through the problem definition several times and choose to create the documentation comments early in that iteration process, be sure to update the documentation comments when you make any changes to the regular expression pattern. If you forget to update the documentation comments, you
36
Regular Expression Tools and an Approach to Using Them
can end up with documentation comments describing something that you no longer intend to do. And that, in my experience, is one of the few situations where having documentation can be worse than having no documentation at all.
In more formal or more complex situations, you may also want to create paper documentation, as part of the documentation of a project, that describes in detail what the regular expressions were intended to do.
Document What You Want to Match
Express as precisely as you can what patterns of characters you want to match. The more formally you make it your habit to express this notion, the more likely you are to fully understand what it is that you really want to do.
Because the effect of a regular expression is to match some sequence of characters, it can be helpful to spell out precisely what sequence(s) of characters it is that you intend to match.
So continuing with the fictional Star Training Company example, you might add these comments to the code:
‘Match Star each time it occurs as part of Star Training Company ‘Match Star when it is standalone but refers to Star Training Company
‘for example in phrases like “Star is the best” ‘Match any occurrence of the possessive form Star’s
After you document clearly what your aim is, you are in a better position to create a regular expression pattern that does exactly what you intended it to do.
Document What You Don’t Want to Select
This may seem the oddest part of the suggested process because, by definition, the text that you don’t want to match is probabl — at first sight, anyway — of least interest to you. However, making mistakes so that you match and change undesired text can give you “moontling” results, as you saw with the Star Training Company example, where the word startling was inappropriately replaced by the sequence of characters Moontling due to a search and replace that wasn’t sufficiently specific.
If the new recruit for Star Training Company had taken time to document words that he didn’t want to change (such as the following), the result of the search and replace might have been less obviously bad:
‘Don’t match any occurrence of words like start, startling
The better you understand your data source and the effects of regular expression patterns that you are considering creating, the more specific you can be in your comments.
Use Whitespace to Aid in Clear Documentation of the
Regular Expression
In several languages, such as Perl, you can spread a regular expression over several lines. That allows you to use whitespace intelligently together with comments for each logical component of the regular expression, so that you achieve a much clearer set of comments. Because each part of the regular expression has its own comments, ambiguity is reduced or avoided.
37
Chapter 2
In some other languages (JavaScript is an example), you cannot use whitespace in this way because all JavaScript statements must be written on a single line. When writing complex regular expressions in a language such as JavaScript, I suggest you consider writing a copy of the regular expression pattern as components with explanation on the lines immediately following the regular expression pattern itself.
In the short term, it adds to your tasks, but it can save time because you are forced to make explicit what you are trying to do. Down the road, it can help you or another developer modify the code with a fuller understanding of the original objectives.
By adding comments in that fashion, you gain many of the benefits of the detailed documentation when patterns are split over several lines. The key difference is that in languages such as Perl and Java you can add comments inside the regular expression on live components of a pattern, whereas in JavaScript you are adding comments to a text copy of components of the code, which is treated by the JavaScript interpreter as comments.
Of course, one risk of doing that is that the working copy of the regular expression is different from the componentized documentation copy.
Test the Results of a Regular Expression
When you are working on a single, fairly simple document, you probably don’t need to test a regular expression other than interactively. Most of the examples in this book, for practical reasons of space, use short, simplified documents. Depending on what you want to use a regular expression to do, you will probably find that you can often carry out regular expression matching interactively when using the examples from this book. That is no more than a simple form of testing. Does the pattern select what you want and avoid selecting undesired matches? Then use it there and then.
However, that interactive approach doesn’t scale. When you are using regular expressions on dozens, hundreds, or perhaps hundreds of thousands of documents or documents that may be many megabytes, you want to be sure that you don’t create a mess like the one shown in the Star Training Company example in Chapter 1, but on a much larger scale.
Therefore, it makes a lot of sense to carefully test a complex regular expression on some appropriate test data to make sure that you find all the character sequences that you want to find and that you don’t inadvertently change character sequences that you don’t want to be changed.
The short sample documents in this book may give you ideas about the kind of test documents you will need to create. But simply copying example documents from this book will very likely not work. You must carefully consider the data that you want to select and possibly also similar data that you want to be sure not to select. Only careful thought about the actual data that you are processing and what changes you want to make will allow you to craft a really useful test document.
If you follow the steps suggested here about how to handle nontrivial regular expression tasks, you should be in good shape to create test documents that are relevant and helpful. Because you should also have a good understanding of the desired text manipulation task, you should be able to take your problem definition and translate that into an appropriate and accurate regular expression pattern to achieve what you want. Testing that regular expression on carefully chosen test data will give you confidence that the large-scale manipulation of textual data will also succeed.
38
Regular Expression Tools and an Approach to Using Them
When you are intending to manipulate large amounts of data, always make sure that you have good backups of the data. If you plan the text manipulation task well, everything ought to go smoothly, and you are unlikely to need to make use of the backups. But undoing an incorrectly designed regular expression task that has been carried out on many megabytes of data is not a desirable situation to get yourself into.
Having backups is one thing. Having backups that you know can be used to restore data is another. The only way you can be totally sure that your backups work is to test that they can be read and used to restore a configuration. That should be a routine quality assurance procedure for valuable data. It is too late to find out that your backups don’t work at the time when you really need them to rescue you from some disaster, whether caused by badly crafted regular expressions or some other cause.
39
3
Simple Regular Expressions
This chapter takes a closer look at some basic aspects of constructing some simple regular expressions. One reason for working through the simple regular expressions examined in this chapter is to reinforce the approach described in Chapter 2 and look at how it can be applied to fairly simple regular expressions.
The examples used are necessarily simple, but by using regular expressions to match fairly simple text patterns, you should become increasingly familiar and comfortable with the use of foundational regular expression constructs that can be used to form part of more complex regular expressions. Later chapters explore additional regular expression constructs and address progressively more complex problems.
One of the issues this chapter explores in some detail is the situation where you want to match occurrences of characters other than those characters simply occurring once.
This chapter looks at the following:
How to match single characters
How to match optional characters
How to match characters that can occur an unbounded number of times, whether the characters of interest are optional or required
How to match characters that can occur a specified number of times
First, let’s look at the simplest situation: matching single characters.