- •Introduction
- •Who This Book Is For
- •What This Book Covers
- •How This Book Is Structured
- •What You Need to Use This Book
- •Conventions
- •Source Code
- •Errata
- •p2p.wrox.com
- •What Are Regular Expressions?
- •What Can Regular Expressions Be Used For?
- •Finding Doubled Words
- •Checking Input from Web Forms
- •Changing Date Formats
- •Finding Incorrect Case
- •Adding Links to URLs
- •Regular Expressions You Already Use
- •Search and Replace in Word Processors
- •Directory Listings
- •Online Searching
- •Why Regular Expressions Seem Intimidating
- •Compact, Cryptic Syntax
- •Whitespace Can Significantly Alter the Meaning
- •No Standards Body
- •Differences between Implementations
- •Characters Change Meaning in Different Contexts
- •Regular Expressions Can Be Case Sensitive
- •Case-Sensitive and Case-Insensitive Matching
- •Case and Metacharacters
- •Continual Evolution in Techniques Supported
- •Multiple Solutions for a Single Problem
- •What You Want to Do with a Regular Expression
- •Replacing Text in Quantity
- •Regular Expression Tools
- •findstr
- •Microsoft Word
- •StarOffice Writer/OpenOffice.org Writer
- •Komodo Rx Package
- •PowerGrep
- •Microsoft Excel
- •JavaScript and JScript
- •VBScript
- •Visual Basic.NET
- •Java
- •Perl
- •MySQL
- •SQL Server 2000
- •W3C XML Schema
- •An Analytical Approach to Using Regular Expressions
- •Express and Document What You Want to Do in English
- •Consider the Regular Expression Options Available
- •Consider Sensitivity and Specificity
- •Create Appropriate Regular Expressions
- •Document All but Simple Regular Expressions
- •Document What You Expect the Regular Expression to Do
- •Document What You Want to Match
- •Test the Results of a Regular Expression
- •Matching Single Characters
- •Matching Sequences of Characters That Each Occur Once
- •Introducing Metacharacters
- •Matching Sequences of Different Characters
- •Matching Optional Characters
- •Matching Multiple Optional Characters
- •Other Cardinality Operators
- •The * Quantifier
- •The + Quantifier
- •The Curly-Brace Syntax
- •The {n} Syntax
- •The {n,m} Syntax
- •Exercises
- •Regular Expression Metacharacters
- •Thinking about Characters and Positions
- •The Period (.) Metacharacter
- •Matching Variably Structured Part Numbers
- •Matching a Literal Period
- •The \w Metacharacter
- •The \W Metacharacter
- •Digits and Nondigits
- •The \d Metacharacter
- •Canadian Postal Code Example
- •The \D Metacharacter
- •Alternatives to \d and \D
- •The \s Metacharacter
- •Handling Optional Whitespace
- •The \S Metacharacter
- •The \t Metacharacter
- •The \n Metacharacter
- •Escaped Characters
- •Finding the Backslash
- •Modifiers
- •Global Search
- •Case-Insensitive Search
- •Exercises
- •Introduction to Character Classes
- •Choice between Two Characters
- •Using Quantifiers with Character Classes
- •Using the \b Metacharacter in Character Classes
- •Selecting Literal Square Brackets
- •Using Ranges in Character Classes
- •Alphabetic Ranges
- •Use [A-z] With Care
- •Digit Ranges in Character Classes
- •Hexadecimal Numbers
- •IP Addresses
- •Reverse Ranges in Character Classes
- •A Potential Range Trap
- •Finding HTML Heading Elements
- •Metacharacter Meaning within Character Classes
- •The ^ metacharacter
- •How to Use the - Metacharacter
- •Negated Character Classes
- •Combining Positive and Negative Character Classes
- •POSIX Character Classes
- •The [:alnum:] Character Class
- •Exercises
- •String, Line, and Word Boundaries
- •The ^ Metacharacter
- •The ^ Metacharacter and Multiline Mode
- •The $ Metacharacter
- •The $ Metacharacter in Multiline Mode
- •Using the ^ and $ Metacharacters Together
- •Matching Blank Lines
- •Working with Dollar Amounts
- •Revisiting the IP Address Example
- •What Is a Word?
- •Identifying Word Boundaries
- •The \< Syntax
- •The \>Syntax
- •The \b Syntax
- •The \B Metacharacter
- •Less-Common Word-Boundary Metacharacters
- •Exercises
- •Grouping Using Parentheses
- •Parentheses and Quantifiers
- •Matching Literal Parentheses
- •U.S. Telephone Number Example
- •Alternation
- •Choosing among Multiple Options
- •Unexpected Alternation Behavior
- •Capturing Parentheses
- •Numbering of Captured Groups
- •Numbering When Using Nested Parentheses
- •Named Groups
- •Non-Capturing Parentheses
- •Back References
- •Exercises
- •Why You Need Lookahead and Lookbehind
- •The (? metacharacters
- •Lookahead
- •Positive Lookahead
- •Negative Lookahead
- •Positive Lookahead Examples
- •Positive Lookahead in the Same Document
- •Inserting an Apostrophe
- •Lookbehind
- •Positive Lookbehind
- •Negative Lookbehind
- •How to Match Positions
- •Adding Commas to Large Numbers
- •Exercises
- •What Are Sensitivity and Specificity?
- •Extreme Sensitivity, Awful Specificity
- •Email Addresses Example
- •Replacing Hyphens Example
- •The Sensitivity/Specificity Trade-Off
- •Sensitivity, Specificity, and Positional Characters
- •Sensitivity, Specificity, and Modes
- •Sensitivity, Specificity, and Lookahead and Lookbehind
- •How Much Should the Regular Expressions Do?
- •Abbreviations
- •Characters from Other Languages
- •Names
- •Sensitivity and How to Achieve It
- •Specificity and How to Maximize It
- •Exercises
- •Documenting Regular Expressions
- •Document the Problem Definition
- •Add Comments to Your Code
- •Making Use of Extended Mode
- •Know Your Data
- •Abbreviations
- •Proper Names
- •Incorrect Spelling
- •Creating Test Cases
- •Debugging Regular Expressions
- •Treacherous Whitespace
- •Backslashes Causing Problems
- •Considering Other Causes
- •The User Interface
- •Metacharacters Available
- •Quantifiers
- •The @ Quantifier
- •The {n,m} Syntax
- •Modes
- •Character Classes
- •Back References
- •Lookahead and Lookbehind
- •Lazy Matching versus Greedy Matching
- •Examples
- •Character Class Examples, Including Ranges
- •Whole Word Searches
- •Search-and-Replace Examples
- •Changing Name Structure Using Back References
- •Manipulating Dates
- •The Star Training Company Example
- •Regular Expressions in Visual Basic for Applications
- •Exercises
- •The User Interface
- •Metacharacters Available
- •Quantifiers
- •Modes
- •Character Classes
- •Alternation
- •Back References
- •Lookahead and Lookbehind
- •Search Example
- •Search-and-Replace Example
- •Online Chats
- •POSIX Character Classes
- •Matching Numeric Digits
- •Exercises
- •Introducing findstr
- •Finding Literal Text
- •Quantifiers
- •Character Classes
- •Command-Line Switch Examples
- •The /v Switch
- •The /a Switch
- •Single File Examples
- •Simple Character Class Example
- •Find Protocols Example
- •Multiple File Example
- •A Filelist Example
- •Exercises
- •The PowerGREP Interface
- •A Simple Find Example
- •The Replace Tab
- •The File Finder Tab
- •Syntax Coloring
- •Other Tabs
- •Numeric Digits and Alphabetic Characters
- •Quantifiers
- •Back References
- •Alternation
- •Line Position Metacharacters
- •Word-Boundary Metacharacters
- •Lookahead and Lookbehind
- •Longer Examples
- •Finding HTML Horizontal Rule Elements
- •Matching Time Example
- •Exercises
- •The Excel Find Interface
- •Escaping Wildcard Characters
- •Using Wildcards in Data Forms
- •Using Wildcards in Filters
- •Exercises
- •Using LIKE with Regular Expressions
- •The % Metacharacter
- •The _ Metacharacter
- •Character Classes
- •Negated Character Classes
- •Using Full-Text Search
- •Using The CONTAINS Predicate
- •Document Filters on Image Columns
- •Exercises
- •Using the _ and % Metacharacters
- •Testing Matching of Literals: _ and % Metacharacters
- •Using Positional Metacharacters
- •Using Character Classes
- •Quantifiers
- •Social Security Number Example
- •Exercises
- •The Interface to Metacharacters in Microsoft Access
- •Creating a Hard-Wired Query
- •Creating a Parameter Query
- •Using the ? Metacharacter
- •Using the * Metacharacter
- •Using the # Metacharacter
- •Using the # Character with Date/Time Data
- •Using Character Classes in Access
- •Exercises
- •The RegExp Object
- •Attributes of the RegExp Object
- •The Other Properties of the RegExp Object
- •The test() Method of the RegExp Object
- •The exec() Method of the RegExp Object
- •The String Object
- •Metacharacters in JavaScript and JScript
- •SSN Validation Example
- •Exercises
- •The RegExp Object and How to Use It
- •Quantifiers
- •Positional Metacharacters
- •Character Classes
- •Word Boundaries
- •Lookahead
- •Grouping and Nongrouping Parentheses
- •Exercises
- •The System.Text.RegularExpressions namespace
- •A Simple Visual Basic .NET Example
- •The Classes of System.Text.RegularExpressions
- •The Regex Object
- •Using the Match Object and Matches Collection
- •Using the Match.Success Property and Match.NextMatch Method
- •The GroupCollection and Group Classes
- •The CaptureCollection and Capture Class
- •The RegexOptions Enumeration
- •Case-Insensitive Matching: The IgnoreCase Option
- •Multiline Matching: The Effect on the ^ and $ Metacharacters
- •Right to Left Matching: The RightToLeft Option
- •Lookahead and Lookbehind
- •Exercises
- •An Introductory Example
- •The Classes of System.Text.RegularExpressions
- •The Regex Class
- •The Options Property of the Regex Class
- •Regex Class Methods
- •The CompileToAssembly() Method
- •The GetGroupNames() Method
- •The GetGroupNumbers() Method
- •GroupNumberFromName() and GroupNameFromNumber() Methods
- •The IsMatch() Method
- •The Match() Method
- •The Matches() Method
- •The Replace() Method
- •The Split() Method
- •Using the Static Methods of the Regex Class
- •The IsMatch() Method as a Static
- •The Match() Method as a Static
- •The Matches() Method as a Static
- •The Replace() Method as a Static
- •The Split() Method as a Static
- •The Match and Matches Classes
- •The Match Class
- •The GroupCollection and Group Classes
- •The RegexOptions Class
- •The IgnorePatternWhitespace Option
- •Metacharacters Supported in Visual C# .NET
- •Using Named Groups
- •Using Back References
- •Exercise
- •The ereg() Set of Functions
- •The ereg() Function
- •The ereg() Function with Three Arguments
- •The eregi() Function
- •The ereg_replace() Function
- •The eregi_replace() Function
- •The split() Function
- •The spliti() Function
- •The sql_regcase() Function
- •Perl Compatible Regular Expressions
- •Pattern Delimiters in PCRE
- •Escaping Pattern Delimiters
- •Matching Modifiers in PCRE
- •Using the preg_match() Function
- •Using the preg_match_all() Function
- •Using the preg_grep() Function
- •Using the preg_quote() Function
- •Using the preg_replace() Function
- •Using the preg_replace_callback() Function
- •Using the preg_split() Function
- •Supported Metacharacters with ereg()
- •Using POSIX Character Classes with PHP
- •Supported Metacharacters with PCRE
- •Positional Metacharacters
- •Character Classes in PHP
- •Documenting PHP Regular Expressions
- •Exercises
- •W3C XML Schema Basics
- •Tools for Using W3C XML Schema
- •Comparing XML Schema and DTDs
- •How Constraints Are Expressed in W3C XML Schema
- •W3C XML Schema Datatypes
- •Derivation by Restriction
- •Unicode and W3C XML Schema
- •Unicode Overview
- •Using Unicode Character Classes
- •Matching Decimal Numbers
- •Mixing Unicode Character Classes with Other Metacharacters
- •Unicode Character Blocks
- •Using Unicode Character Blocks
- •Metacharacters Supported in W3C XML Schema
- •Positional Metacharacters
- •Matching Numeric Digits
- •Alternation
- •Using the \w and \s Metacharacters
- •Escaping Metacharacters
- •Exercises
- •Introduction to the java.util.regex Package
- •Obtaining and Installing Java
- •The Pattern Class
- •Using the matches() Method Statically
- •Two Simple Java Examples
- •The Properties (Fields) of the Pattern Class
- •The CASE_INSENSITIVE Flag
- •Using the COMMENTS Flag
- •The DOTALL Flag
- •The MULTILINE Flag
- •The UNICODE_CASE Flag
- •The UNIX_LINES Flag
- •The Methods of the Pattern Class
- •The compile() Method
- •The flags() Method
- •The matcher() Method
- •The matches() Method
- •The pattern() Method
- •The split() Method
- •The Matcher Class
- •The appendReplacement() Method
- •The appendTail() Method
- •The end() Method
- •The find() Method
- •The group() Method
- •The groupCount() Method
- •The lookingAt() Method
- •The matches() Method
- •The pattern() Method
- •The replaceAll() Method
- •The replaceFirst() Method
- •The reset() Method
- •The start() Method
- •The PatternSyntaxException Class
- •Using the \d Metacharacter
- •Character Classes
- •The POSIX Character Classes in the java.util.regex Package
- •Unicode Character Classes and Character Blocks
- •Using Escaped Characters
- •Using Methods of the String Class
- •Using the matches() Method
- •Using the replaceFirst() Method
- •Using the replaceAll() Method
- •Using the split() Method
- •Exercises
- •Obtaining and Installing Perl
- •Creating a Simple Perl Program
- •Basics of Perl Regular Expression Usage
- •Using the m// Operator
- •Using Other Regular Expression Delimiters
- •Matching Using Variable Substitution
- •Using the s/// Operator
- •Using s/// with the Global Modifier
- •Using s/// with the Default Variable
- •Using the split Operator
- •Using Quantifiers in Perl
- •Using Positional Metacharacters
- •Captured Groups in Perl
- •Using Back References in Perl
- •Using Alternation
- •Using Character Classes in Perl
- •Using Lookahead
- •Using Lookbehind
- •Escaping Metacharacters
- •A Simple Perl Regex Tester
- •Exercises
- •Index
Chapter 9
Figure 9-7
The Sensitivity/Specificity Trade-Off
Sensitivity and specificity are always part of a trade-off. Sensitivity and specificity are components of the trade-off, but the amount of effort required to get 100 percent sensitivity and 100 percent specificity may not be practical in some situations. Some undefined “good” specificity may be enough. It’s a trade-off in that, in the end, only you can judge how much effort is appropriate for the task that you are using regular expressions to achieve.
How important are sensitivity and specificity? The answer is, “It depends.” There are many times when you will need high sensitivity, 100 percent sensitivity ideally, and at the same time you also need high specificity. At other times, one or the other may be less important. This section looks at some of the factors that influence how much importance it is relevant to place on sensitivity and specificity.
It depends to a significant extent on who the customer is. If you are using regular expressions to achieve something for your own use, you may not worry too much if you miss one or two matches. On the other hand, if you are conducting a replacement of every occurrence of a company name after a takeover, for example, it would be serious if sensitivity fell below 100 percent.
How Metacharacters Affect Sensitivity and
Specificity
In general, the more metacharacters you use, the more specific a pattern becomes. The pattern cat matches that sequence of characters whether they refer to a feline mammal or form character sequences in words such as cathode and caterpillar.
230
Sensitivity and Specificity of Regular Expressions
Adding further metacharacters, such as the \b word boundary, makes the use of the character sequence cat in a pattern much more specific. The pattern \bcat\b will match only the word cat (singular).
When using specific patterns like that, you need to watch carefully for the possibility of reducing sensitivity. The pattern \bcat\b will match cat but won’t match cats, for example. If you are interested in finding all references in the document to feline mammals, the \bcat\b pattern may not be the best option. You may want to allow for the occurrence of the plural form, cats, and the possessive form, cat’s, too. The pattern \bcat’?s?’?\b would match cat, cats, cats’ (plural possessive), and cat’s (singular possessive) but would also match cat’, which is unlikely to be a desired match. If your data is unlikely to contain the character sequence cat’, the pattern \bcat’?s?’?\b may be sufficient. But if, for some reason, you want to match only cat, cats, cat’s, and cats’, some other, more specific pattern will be needed. One simple option is as follows:
(cat|cats|cat’s|cats’)
An alternative follows:
ca(t|ts|t’s|ts)
Similar issues apply whatever the word or sequence of characters of interest.
Sensitivity, Specificity, and Positional Characters
The positional characters explored in Chapter 6 can be expected in many cases to affect both sensitivity and specificity.
In the following example, an initial version of the problem definition can be expressed as follows:
Match all occurrences of the sequence of characters t, h, and e case insensitively.
The pattern the will match twice in the following text:
Paris in the the spring.
It will match once in the following text:
The spring has sprung.
However, suppose you modify the problem definition to the following:
Match the position at the beginning of a string; then match the sequence of characters t, h, and e case insensitively.
The pattern ^the now has no match in the first sample text but still has a single match in the second. The effect of adding one or more positional metacharacters depends on the data the pattern is being matched against.
231
Chapter 9
Sensitivity, Specificity, and Modes
When you specify that a regular expression is to be executed in a case-insensitive or case-sensitive mode, you affect the matches that will be returned. Continuing with the preceding example, the pattern ^the applied case sensitively has no match in either of the two test pieces of text. In the second sample text, the ^ metacharacter matches the position at the beginning of the string, but the lowercase t of the pattern does not match the uppercase T of the test text.
Similarly, the use of the period metacharacter (which matches a large range of characters) can be switched to match or not match a newline character.
Sensitivity, Specificity, and Lookahead and Lookbehind
When you add lookahead or lookbehind to an existing regular expression, you may have no effect on sensitivity ,or you may adversely impact it. Equally, you may improve specificity or, less likely, it may stay the same.
If a lookbehind is carefully crafted, it won’t reduce sensitivity. However, if you make an error in the pattern inside the lookbehind, you will fail to match when you intended to match, reducing sensitivity. Suppose that you wanted to find information about Anne Smith. The following pattern would match when the spelling of Anne is correct, and it is followed by exactly one space character:
(?<=Anne )Smith
However, if Anne is spelled as Ann somewhere in the document you may miss intended matches, because the pattern (?<=Anne )Smith will no longer match.
Equally, if the person’s name were written as A. Smith somewhere in the data, there would be no match. More detailed understanding of the data would be needed to know whether a match was intended or not. The character sequence A. Smith might refer to the person of interest, Anne Smith, but alternatively might refer to Adam Smith or some other person.
Similarly, lookahead can reduce sensitivity. For example, suppose that you want to match all occurrences of the character sequence John. The following pattern would match a word boundary, then the desired character sequence John, and then check if the following character is a space character:
\bJohn(?= )
However, if the test text is as follows, the lookahead is too specific and causes what is likely to be a desired match to fail:
I went with John, and Mary on a trip.
Modifying the lookahead to (?=\b) or (?=\W) would prevent the problem caused by the occurrence of an unanticipated comma.
How Much Should the Regular Expressions Do?
Most of the examples earlier in this book use a range of tools with regular expression functionality to apply regular expressions. That’s great when teaching regular expressions, but when you use regular
232
Sensitivity and Specificity of Regular Expressions
expressions as a developer, you will typically be using regular expressions inside code written in Java, JavaScript, VB.NET, and so on, or you may be applying regular expressions to data retrieved from a relational database. So how much should you expect the regular expressions to do, and how much can you safely assume that your other code or the error checking in a database already does?
For example, suppose you have a collection of HTML documents that include IP addresses, and your task is to amend the style that the IP addresses are displayed in. Suppose that initially, IP addresses are nested inside the start and end tags for HTML b elements, as in the following:
<b>1.12.123.234</b>
What pattern should you use to find such IP addresses? Should you just assume that the data you receive will be correctly formed (including having no values of 256 or more), or should you include a more complex pattern so that the regular expression will match only correctly formed IP addresses?
If you assume that the IP addresses are already correctly formed or are checked by some other part of your code, you could use a fairly simple pattern such as the following:
<b>([0-9]+(\.[0-9]+){3})</b>
This would match character sequences that are not IP addresses, such as the following:
<b>1234.2345.5678.9999999</b>
If you can be sure that your data doesn’t include undesired values such as the preceding one, the simple pattern shown might be enough. Without much work, you can adapt the pattern so that only between one and three numeric digits can be included before or after a period character:
<b>([0-9]{1,3}(\.[0-9]{1,3})</b>
Inappropriate character sequences such as the following would still be matched, but at least you improve the specificity a little by excluding false matches with multiple numeric digits, as shown earlier:
<b>999.256.789.1</b>
If, however, you can’t be sure that the supposed IP addresses are correctly formatted, you may need to develop a longer, more complex pattern. On the other hand, it may not matter for a particular purpose whether the supposed IP addresses are valid IP addresses or not. If that is the situation, the simplest regular expression is likely to be an appropriate option to use.
Knowing the Data, Sensitivity, and
Specificity
One of the key issues that affect how well you achieve sensitivity and specificity is how well you understand the data to which you are applying regular expressions. Of course, your understanding of the regular expression syntax and techniques supported by your chosen language or tool is important, too.
233