- •Introduction
- •Who This Book Is For
- •What This Book Covers
- •How This Book Is Structured
- •What You Need to Use This Book
- •Conventions
- •Source Code
- •Errata
- •p2p.wrox.com
- •What Are Regular Expressions?
- •What Can Regular Expressions Be Used For?
- •Finding Doubled Words
- •Checking Input from Web Forms
- •Changing Date Formats
- •Finding Incorrect Case
- •Adding Links to URLs
- •Regular Expressions You Already Use
- •Search and Replace in Word Processors
- •Directory Listings
- •Online Searching
- •Why Regular Expressions Seem Intimidating
- •Compact, Cryptic Syntax
- •Whitespace Can Significantly Alter the Meaning
- •No Standards Body
- •Differences between Implementations
- •Characters Change Meaning in Different Contexts
- •Regular Expressions Can Be Case Sensitive
- •Case-Sensitive and Case-Insensitive Matching
- •Case and Metacharacters
- •Continual Evolution in Techniques Supported
- •Multiple Solutions for a Single Problem
- •What You Want to Do with a Regular Expression
- •Replacing Text in Quantity
- •Regular Expression Tools
- •findstr
- •Microsoft Word
- •StarOffice Writer/OpenOffice.org Writer
- •Komodo Rx Package
- •PowerGrep
- •Microsoft Excel
- •JavaScript and JScript
- •VBScript
- •Visual Basic.NET
- •Java
- •Perl
- •MySQL
- •SQL Server 2000
- •W3C XML Schema
- •An Analytical Approach to Using Regular Expressions
- •Express and Document What You Want to Do in English
- •Consider the Regular Expression Options Available
- •Consider Sensitivity and Specificity
- •Create Appropriate Regular Expressions
- •Document All but Simple Regular Expressions
- •Document What You Expect the Regular Expression to Do
- •Document What You Want to Match
- •Test the Results of a Regular Expression
- •Matching Single Characters
- •Matching Sequences of Characters That Each Occur Once
- •Introducing Metacharacters
- •Matching Sequences of Different Characters
- •Matching Optional Characters
- •Matching Multiple Optional Characters
- •Other Cardinality Operators
- •The * Quantifier
- •The + Quantifier
- •The Curly-Brace Syntax
- •The {n} Syntax
- •The {n,m} Syntax
- •Exercises
- •Regular Expression Metacharacters
- •Thinking about Characters and Positions
- •The Period (.) Metacharacter
- •Matching Variably Structured Part Numbers
- •Matching a Literal Period
- •The \w Metacharacter
- •The \W Metacharacter
- •Digits and Nondigits
- •The \d Metacharacter
- •Canadian Postal Code Example
- •The \D Metacharacter
- •Alternatives to \d and \D
- •The \s Metacharacter
- •Handling Optional Whitespace
- •The \S Metacharacter
- •The \t Metacharacter
- •The \n Metacharacter
- •Escaped Characters
- •Finding the Backslash
- •Modifiers
- •Global Search
- •Case-Insensitive Search
- •Exercises
- •Introduction to Character Classes
- •Choice between Two Characters
- •Using Quantifiers with Character Classes
- •Using the \b Metacharacter in Character Classes
- •Selecting Literal Square Brackets
- •Using Ranges in Character Classes
- •Alphabetic Ranges
- •Use [A-z] With Care
- •Digit Ranges in Character Classes
- •Hexadecimal Numbers
- •IP Addresses
- •Reverse Ranges in Character Classes
- •A Potential Range Trap
- •Finding HTML Heading Elements
- •Metacharacter Meaning within Character Classes
- •The ^ metacharacter
- •How to Use the - Metacharacter
- •Negated Character Classes
- •Combining Positive and Negative Character Classes
- •POSIX Character Classes
- •The [:alnum:] Character Class
- •Exercises
- •String, Line, and Word Boundaries
- •The ^ Metacharacter
- •The ^ Metacharacter and Multiline Mode
- •The $ Metacharacter
- •The $ Metacharacter in Multiline Mode
- •Using the ^ and $ Metacharacters Together
- •Matching Blank Lines
- •Working with Dollar Amounts
- •Revisiting the IP Address Example
- •What Is a Word?
- •Identifying Word Boundaries
- •The \< Syntax
- •The \>Syntax
- •The \b Syntax
- •The \B Metacharacter
- •Less-Common Word-Boundary Metacharacters
- •Exercises
- •Grouping Using Parentheses
- •Parentheses and Quantifiers
- •Matching Literal Parentheses
- •U.S. Telephone Number Example
- •Alternation
- •Choosing among Multiple Options
- •Unexpected Alternation Behavior
- •Capturing Parentheses
- •Numbering of Captured Groups
- •Numbering When Using Nested Parentheses
- •Named Groups
- •Non-Capturing Parentheses
- •Back References
- •Exercises
- •Why You Need Lookahead and Lookbehind
- •The (? metacharacters
- •Lookahead
- •Positive Lookahead
- •Negative Lookahead
- •Positive Lookahead Examples
- •Positive Lookahead in the Same Document
- •Inserting an Apostrophe
- •Lookbehind
- •Positive Lookbehind
- •Negative Lookbehind
- •How to Match Positions
- •Adding Commas to Large Numbers
- •Exercises
- •What Are Sensitivity and Specificity?
- •Extreme Sensitivity, Awful Specificity
- •Email Addresses Example
- •Replacing Hyphens Example
- •The Sensitivity/Specificity Trade-Off
- •Sensitivity, Specificity, and Positional Characters
- •Sensitivity, Specificity, and Modes
- •Sensitivity, Specificity, and Lookahead and Lookbehind
- •How Much Should the Regular Expressions Do?
- •Abbreviations
- •Characters from Other Languages
- •Names
- •Sensitivity and How to Achieve It
- •Specificity and How to Maximize It
- •Exercises
- •Documenting Regular Expressions
- •Document the Problem Definition
- •Add Comments to Your Code
- •Making Use of Extended Mode
- •Know Your Data
- •Abbreviations
- •Proper Names
- •Incorrect Spelling
- •Creating Test Cases
- •Debugging Regular Expressions
- •Treacherous Whitespace
- •Backslashes Causing Problems
- •Considering Other Causes
- •The User Interface
- •Metacharacters Available
- •Quantifiers
- •The @ Quantifier
- •The {n,m} Syntax
- •Modes
- •Character Classes
- •Back References
- •Lookahead and Lookbehind
- •Lazy Matching versus Greedy Matching
- •Examples
- •Character Class Examples, Including Ranges
- •Whole Word Searches
- •Search-and-Replace Examples
- •Changing Name Structure Using Back References
- •Manipulating Dates
- •The Star Training Company Example
- •Regular Expressions in Visual Basic for Applications
- •Exercises
- •The User Interface
- •Metacharacters Available
- •Quantifiers
- •Modes
- •Character Classes
- •Alternation
- •Back References
- •Lookahead and Lookbehind
- •Search Example
- •Search-and-Replace Example
- •Online Chats
- •POSIX Character Classes
- •Matching Numeric Digits
- •Exercises
- •Introducing findstr
- •Finding Literal Text
- •Quantifiers
- •Character Classes
- •Command-Line Switch Examples
- •The /v Switch
- •The /a Switch
- •Single File Examples
- •Simple Character Class Example
- •Find Protocols Example
- •Multiple File Example
- •A Filelist Example
- •Exercises
- •The PowerGREP Interface
- •A Simple Find Example
- •The Replace Tab
- •The File Finder Tab
- •Syntax Coloring
- •Other Tabs
- •Numeric Digits and Alphabetic Characters
- •Quantifiers
- •Back References
- •Alternation
- •Line Position Metacharacters
- •Word-Boundary Metacharacters
- •Lookahead and Lookbehind
- •Longer Examples
- •Finding HTML Horizontal Rule Elements
- •Matching Time Example
- •Exercises
- •The Excel Find Interface
- •Escaping Wildcard Characters
- •Using Wildcards in Data Forms
- •Using Wildcards in Filters
- •Exercises
- •Using LIKE with Regular Expressions
- •The % Metacharacter
- •The _ Metacharacter
- •Character Classes
- •Negated Character Classes
- •Using Full-Text Search
- •Using The CONTAINS Predicate
- •Document Filters on Image Columns
- •Exercises
- •Using the _ and % Metacharacters
- •Testing Matching of Literals: _ and % Metacharacters
- •Using Positional Metacharacters
- •Using Character Classes
- •Quantifiers
- •Social Security Number Example
- •Exercises
- •The Interface to Metacharacters in Microsoft Access
- •Creating a Hard-Wired Query
- •Creating a Parameter Query
- •Using the ? Metacharacter
- •Using the * Metacharacter
- •Using the # Metacharacter
- •Using the # Character with Date/Time Data
- •Using Character Classes in Access
- •Exercises
- •The RegExp Object
- •Attributes of the RegExp Object
- •The Other Properties of the RegExp Object
- •The test() Method of the RegExp Object
- •The exec() Method of the RegExp Object
- •The String Object
- •Metacharacters in JavaScript and JScript
- •SSN Validation Example
- •Exercises
- •The RegExp Object and How to Use It
- •Quantifiers
- •Positional Metacharacters
- •Character Classes
- •Word Boundaries
- •Lookahead
- •Grouping and Nongrouping Parentheses
- •Exercises
- •The System.Text.RegularExpressions namespace
- •A Simple Visual Basic .NET Example
- •The Classes of System.Text.RegularExpressions
- •The Regex Object
- •Using the Match Object and Matches Collection
- •Using the Match.Success Property and Match.NextMatch Method
- •The GroupCollection and Group Classes
- •The CaptureCollection and Capture Class
- •The RegexOptions Enumeration
- •Case-Insensitive Matching: The IgnoreCase Option
- •Multiline Matching: The Effect on the ^ and $ Metacharacters
- •Right to Left Matching: The RightToLeft Option
- •Lookahead and Lookbehind
- •Exercises
- •An Introductory Example
- •The Classes of System.Text.RegularExpressions
- •The Regex Class
- •The Options Property of the Regex Class
- •Regex Class Methods
- •The CompileToAssembly() Method
- •The GetGroupNames() Method
- •The GetGroupNumbers() Method
- •GroupNumberFromName() and GroupNameFromNumber() Methods
- •The IsMatch() Method
- •The Match() Method
- •The Matches() Method
- •The Replace() Method
- •The Split() Method
- •Using the Static Methods of the Regex Class
- •The IsMatch() Method as a Static
- •The Match() Method as a Static
- •The Matches() Method as a Static
- •The Replace() Method as a Static
- •The Split() Method as a Static
- •The Match and Matches Classes
- •The Match Class
- •The GroupCollection and Group Classes
- •The RegexOptions Class
- •The IgnorePatternWhitespace Option
- •Metacharacters Supported in Visual C# .NET
- •Using Named Groups
- •Using Back References
- •Exercise
- •The ereg() Set of Functions
- •The ereg() Function
- •The ereg() Function with Three Arguments
- •The eregi() Function
- •The ereg_replace() Function
- •The eregi_replace() Function
- •The split() Function
- •The spliti() Function
- •The sql_regcase() Function
- •Perl Compatible Regular Expressions
- •Pattern Delimiters in PCRE
- •Escaping Pattern Delimiters
- •Matching Modifiers in PCRE
- •Using the preg_match() Function
- •Using the preg_match_all() Function
- •Using the preg_grep() Function
- •Using the preg_quote() Function
- •Using the preg_replace() Function
- •Using the preg_replace_callback() Function
- •Using the preg_split() Function
- •Supported Metacharacters with ereg()
- •Using POSIX Character Classes with PHP
- •Supported Metacharacters with PCRE
- •Positional Metacharacters
- •Character Classes in PHP
- •Documenting PHP Regular Expressions
- •Exercises
- •W3C XML Schema Basics
- •Tools for Using W3C XML Schema
- •Comparing XML Schema and DTDs
- •How Constraints Are Expressed in W3C XML Schema
- •W3C XML Schema Datatypes
- •Derivation by Restriction
- •Unicode and W3C XML Schema
- •Unicode Overview
- •Using Unicode Character Classes
- •Matching Decimal Numbers
- •Mixing Unicode Character Classes with Other Metacharacters
- •Unicode Character Blocks
- •Using Unicode Character Blocks
- •Metacharacters Supported in W3C XML Schema
- •Positional Metacharacters
- •Matching Numeric Digits
- •Alternation
- •Using the \w and \s Metacharacters
- •Escaping Metacharacters
- •Exercises
- •Introduction to the java.util.regex Package
- •Obtaining and Installing Java
- •The Pattern Class
- •Using the matches() Method Statically
- •Two Simple Java Examples
- •The Properties (Fields) of the Pattern Class
- •The CASE_INSENSITIVE Flag
- •Using the COMMENTS Flag
- •The DOTALL Flag
- •The MULTILINE Flag
- •The UNICODE_CASE Flag
- •The UNIX_LINES Flag
- •The Methods of the Pattern Class
- •The compile() Method
- •The flags() Method
- •The matcher() Method
- •The matches() Method
- •The pattern() Method
- •The split() Method
- •The Matcher Class
- •The appendReplacement() Method
- •The appendTail() Method
- •The end() Method
- •The find() Method
- •The group() Method
- •The groupCount() Method
- •The lookingAt() Method
- •The matches() Method
- •The pattern() Method
- •The replaceAll() Method
- •The replaceFirst() Method
- •The reset() Method
- •The start() Method
- •The PatternSyntaxException Class
- •Using the \d Metacharacter
- •Character Classes
- •The POSIX Character Classes in the java.util.regex Package
- •Unicode Character Classes and Character Blocks
- •Using Escaped Characters
- •Using Methods of the String Class
- •Using the matches() Method
- •Using the replaceFirst() Method
- •Using the replaceAll() Method
- •Using the split() Method
- •Exercises
- •Obtaining and Installing Perl
- •Creating a Simple Perl Program
- •Basics of Perl Regular Expression Usage
- •Using the m// Operator
- •Using Other Regular Expression Delimiters
- •Matching Using Variable Substitution
- •Using the s/// Operator
- •Using s/// with the Global Modifier
- •Using s/// with the Default Variable
- •Using the split Operator
- •Using Quantifiers in Perl
- •Using Positional Metacharacters
- •Captured Groups in Perl
- •Using Back References in Perl
- •Using Alternation
- •Using Character Classes in Perl
- •Using Lookahead
- •Using Lookbehind
- •Escaping Metacharacters
- •A Simple Perl Regex Tester
- •Exercises
- •Index
C# and Regular Expressions
The regular expression to be matched is in the second argument of the Replace() method, (\w+)\s+(\1). That pattern matches a sequence of word characters equivalent to the character class [A-Za-z0-9_] followed by one or more whitespace characters and, as indicated by the \1 back reference, the same sequence of word characters that has already been matched. In other words, the pattern matches a doubled word separated by whitespace.
The third argument of the Replace() method is the pattern to be used to replace any matched text. The matched text contains the doubled word (if one exists). The replacement text uses the numbered group corresponding to the back reference, ${1}, to replace two occurrences of the word with one:
string outputString = Regex.Replace(inputString, @”(\w+)\s+(\1)”,
“${1}”);
Then the original string and the changed string are displayed to the user:
Console.WriteLine(“You entered the string: ‘“ + inputString + “‘.”);
Console.WriteLine(“The replaced string is ‘“ + outputString + “‘.”);
Console.ReadLine();
Exercise
1.Which of the RegexOptions is used to specify case-insensitive matching?
547
23
PHP and Regular
Expressions
PHP, the PHP Hypertext Processor, is a widely used language for Web-based applications. One common task in Web-based applications, whatever language is used, is the validation of user input either on the client side or on the server side before data is written to a relational database.
PHP is typically used on the server side and has similarities to ASP and ASP.NET. To work through the examples in this chapter, you will need to install PHP on a Web server.
In this chapter, you will learn the following:
How to get started with PHP 5.0
How PHP structures support for regular expressions
How to use the ereg() family of functions
What metacharacters are supported in PHP in Perl Compatible Regular Expressions (PCRE)
How to match commonly needed user entries
This chapter describes the regular expression functionality in PHP version 5.0.
Getting Star ted with PHP 5.0
To run the examples shown in this chapter, you must install PHP on a Web server. Because this book is focusing on the use of regular expressions on the Windows platforms, the focus will be on installing PHP on a Windows IIS server.
With the advent of PHP 5.0, the recommended methods of installing PHP have changed significantly from those previously recommended on www.php.net.
Chapter 23
The PHP Web site at www.php.net is the official source of up-to-date information about PHP. This chapter focuses on PHP 5.0 functionality, but it is possible that recommendations on installation and/or configuration will change. I suggest that you check the URL given for the current situation.
If you need PHP 4 rather than PHP 5, that can still be downloaded from www.php.net/ downloads.php at the time of this writing. If, for compatibility reasons, you need PHP 3, it can be downloaded from http://museum.php.net/.
The following instructions describe how to install PHP 5.0.1 using the Windows installer package. It is assumed that you have already installed IIS. The Windows installer package is the easiest way to install PHP on Windows, but it has limitations. The PHP installation files are also available as a .zip file, which has to be installed manually but does allow full control over how PHP is installed. Because the focus of this chapter is the use of regular expressions with PHP, rather than on a detailed consideration of PHP installation on a Web server, no information on installation and configuration of the .zip file download is provided here.
The following instructions should get you up and running. But be aware that they take no account of how to create a secure PHP installation. If you want to use PHP on a production server, be sure to invest time in fully understanding the security issues relating to the use of PHP on the Internet.
Try It Out |
Installing PHP Using the Windows Installer |
1.Download the Windows installer from the download page on www.php.net (at the time of this writing, downloads were listed on www.php.net/downloads.php), and double-click the Windows installer package. Figure 23-1 shows the initial screen of the installer package for PHP 5.0.1.
Figure 23-1
2.Click the Next button. Read the License Agreement, and click the I Agree button. If you don’t accept the license, you won’t be able to use the installer to install PHP 5.0.1.
550
PHP and Regular Expressions
3.On the next screen, you are offered a choice between Standard and Advanced installation. Select Advanced, and click the Next button.
4.Choose a location for installation. I chose C:\PHP 5.0.1. Click the Next button.
5.You are then asked if you want to create backups of any file replaced during installation. Leave the default option, Yes, and click the Next button.
6.Accept the default upload directory, and click the Next button. Accept the default directory for session information, and click the Next button.
7.Accept localhost as your SMTP server location or modify it as appropriate. For the purposes of this test installation, I suggest that you accept localhost; then click the Next button.
8.Accept the default option about warnings and errors, and click the Next button.
9.On the next screen, the installer is likely to recognize the version of IIS or Personal Web Server (PWS) you have installed. Unless you have good reason to do otherwise, accept the default, and click the Next button.
10.Select the file extensions to be associated with PHP. I suggest that you restrict this to .php unless you have a specific need to do otherwise. Click the Next button.
11.On the next screen, you are informed that the installer has the needed information to carry out the installation. Click the Next button. The installer will display messages about progress of the installation. If all has gone well, you should see the message shown in Figure 23-2.
Figure 23-2
Now that PHP appears to have been installed successfully, you need to test whether it works correctly with IIS. The following instructions assume that IIS is installed, that it is running on the local machine, and that the default directory for IIS content is C:\inetpub\wwwroot\. If you have a different setup, amend the following instructions accordingly.
12.In Notepad or some other text editor, type the following code:
<?php
phpinfo()
?>
13.Create a subdirectory PHP in C:\inetpub\wwwroot\ or an alternative location, if you prefer. That will allow you to access PHP code using the URL http://localhost/PHP/ plus the relevant PHP filename.
551
Chapter 23
14.Save the file as phpinfo.php in the PHP directory. If you are saving from Notepad, be sure to enclose the filename in paired double quotes, or Notepad will save the file as phpinfo.php.txt, which won’t run correctly when accessed from a Web browser.
15.Open Internet Explorer or an alternative browser, and type the URL http://localhost/ PHP/phpinfo.php into the browser.
Figure 23-3 shows the result you should expect to see after this step. Naturally, if you are not using Internet Explorer 6.0 or PHP 5.0.1, the appearance will differ a little from that shown. However, if you see a Web page similar to that in Figure 23-3, you have a successful install of PHP.
Figure 23-3
16.Use the Ctrl+F keyboard shortcut to search for the text PCRE in the Web page. As shown in Figure 23-4, the PCRE functionality is enabled by default in PHP 5.0.1 installed using the Windows installer option. Because some of the examples in this chapter depend on the presence of PCRE functionality, it is important that you verify that it is enabled.
Now that you know your PHP installation is working, you can move on to take a closer look at the ways in which regular expression functionality is supported in PHP 5.0.
552