Добавил:

fench Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Сумский государственный университет

Предмет:

Программирование

Файл:

Beginning Regular Expressions 2005.pdf

Скачиваний:

Добавлен:

17.08.2013

Размер:

25.42 Mб

Скачать

☆

<<< < Предыдущая 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167168 / 169168 169 > Следующая >>>

Regular Expressions in Perl

A Simple Perl Regex Tester

You have seen a range of techniques used to explore some of the ways Perl regular expressions can be used. You may find it useful to have a simple Perl tool to test regular expressions against test strings. RegexTester.pl is intended to provide you with straightforward functionality to do that.

The code for RegexTester.pl is shown here (the file is available in the code download):

#!/usr/bin/perl -w use strict;

print “This is a simple Regular Expression Tester.\n”; print “First, enter the pattern you want to test.\n”;

print “Remember NOT to escape metacharacters like \\d with an extra \\ when you supply a pattern on the command line.\n”;

print “Enter your pattern here: “; my $myPattern = <STDIN>; chomp($myPattern);

print “The pattern being tested is ‘$myPattern’.”; print “Enter a test string:\n”;

while (<>)

{

chomp();

if (/$myPattern/)

{

print “Matched ‘$&’ in ‘$_’\n”;

print “\nEnter another test string (or Ctrl+C to terminate):”;

}

else

{

print “No match was found for ‘$myPattern’ in ‘$_’.\n”;

print “\nEnter another test string (or Ctrl+C to terminate):”;

}

Try It Out Using the Simple Perl Regex Tester

1.Run RegexTester.pl from the command line, using the command perl RegexTester.pl.

2.Enter the pattern \d{5}-\d{4}, which matches an extended U.S. Zip code but does not match the abbreviated Zip code form.

3.Enter the test string 12345-6789, and inspect the displayed result.

4.Enter the test string 12345, and inspect the result, as shown in Figure 26-33.

Figure 26-33

703

Chapter 26

How It Works

First, some straightforward information is displayed to remind the user what the program does:

print “This is a simple Regular Expression Tester.\n”;

print “First, enter the pattern you want to test.\n”;

Paradoxically, in the message that tells the user not to escape metacharacters, such as \d, you have to escape the \d to get it to display correctly. The same applies to displaying the backslash character:

print “Remember NOT to escape metacharacters like \\d with an extra \\ when you

supply a pattern on the command line.\n”;

Then instruct the user to enter a pattern:

print “Enter your pattern here: “;

Use the <STDIN> to capture the line of input from the user. It contains the pattern that the user specified plus a newline character:

my $myPattern = <STDIN>;

Use chomp() to remove the undesired newline character:

chomp($myPattern);

Tell the user the pattern that was entered, so the user can check for any typos before wasting time matching against a pattern that is different from the pattern he or she thinks is being matched against:

print “The pattern being tested is ‘$myPattern’.”;

Ask the user to enter a test string:

print “Enter a test string:\n”;

Then use <> to indicate to keep looping while there is another line of input from the user:

while (<>)

{

The test string that the user has input has an undesired newline character at the end of it, so you use chomp() to get rid of the newline:

chomp();

Then the test for the if statement tests whether there is a match in the input line (after chomp() has been used) against the pattern $myPattern:

if (/$myPattern/)

{

704

Regular Expressions in Perl

If there is a match, the special variable $& contains it. So you tell the user what character sequence matched the pattern:

print “Matched ‘$&’ in ‘$_’\n”;

Then you invite the user to either input another test string or terminate the program:

print “\nEnter another test string (or Ctrl+C to terminate):”;

}

If no match is found, the statement block for the else clause is executed. The user is informed that there is no match and that he or she has a choice to enter another test string or terminate the program:

else

{

print “No match was found for ‘$myPattern’ in ‘$_’.\n”;

print “\nEnter another test string (or Ctrl+C to terminate):”;

}

Exercises

1.Create a pattern for a 16-digit credit card number, allowing the user the option to split the numeric digits into groups of four. Assume for the purposes of this exercise that all numeric digits are acceptable in all positions where a numeric digit is expected. Use the RegexTester.pl to test the test strings 1234 5678 9012 3456 and 1234567890123456.

2.Modify the example LookBehind.pl so that it matches Training when it is not preceded by the character sequence Star followed by a space character. Make sure that the code you create is working by testing it with the test strings Training is great! and Star Training is great!.

705

Exercise Answers

Chapter 3

1.You can match a doubled s using the pattern ss or s{2}. To match doubled m, use the pattern mm or m{2}.

2.The pattern AB\d\d or AB\d{2} would match the specified text.

3.You need to change only one line in UpperL.html to achieve the desired result. For tidiness you should also change the content of the title element to reflect the changed functionality.

The modified file UpperLmodified.html is shown here. The edited lines are highlighted. The second highlighted line causes the variable myRegExp to attempt to match the pattern the:

<html>

<head>

<title>Check for character sequence ‘the’.</title> <script language=”javascript” type=”text/javascript”> var myRegExp = /the/;

function Validate(entry){ return myRegExp.test(entry); } // end function Validate()

function ShowPrompt(){

var entry = prompt(“This script tests for matches for the regular expression pattern: “ + myRegExp + “.\nType in a string and click on the OK button.”, “Type your text here.”);

if (Validate(entry)){

alert(“There is a match!\nThe regular expression pattern is: “ + myRegExp + “.\n The string that you entered was: ‘“ + entry + “‘.”);

} // end if else{

alert(“There is no match in the string you entered.\n” + “The regular expression pattern is “ + myRegExp + “\n” + “You entered the string: ‘“ + entry + “‘.” );

Appendix A

}// end else

}// end function ShowPrompt()

</script>

</head>

<body>

<button type=”Button” onclick=”ShowPrompt()”>Click here to enter text.</button> </form>

</body>

</html>

Chapter 4

1.The . metacharacter matches all characters except a newline character. The \w metacharacter matches only ASCII alphabetic character (both uppercase and lowercase A through Z).

2.The line with the revised pattern is as follows:

var myRegExp = /<Person DateOfBirth\s*=\s*”.*”\s*>/;

The revised pattern adds the pattern \s* on both sides of the = character.

Chapter 5

1.The problem definition to match license and licence should be something like this:

Match a lowercase l followed by a lowercase i, followed by a lowercase c, followed by a lowercase e, followed by a lowercase n, followed by a choice of either lowercase s or lowercase c, followed by a lowercase e.

Two regular expression patterns are correct. The regular expression pattern needed is either:

licen[cs]e

or:

licen[sc]e

These are identical in meaning.

2.Let’s break the solution for this question into two parts.

For the month, there are two situations: when the first digit is 0 and when the first digit is 1.

When the first digit is 0, you want the second digit to be 1 through 9. You can write a pattern to reflect that as follows:

0[1-9]

When the first digit of the month is 1, the second digit must be 0, 1, or 2. You can use the following patterns to reflect that:

708

Exercise Answers

1[012]

or:

1[0-2]

Because you want either of those choices, you can use parentheses with a bar character, |, between them to specify months 01 through 12:

(0[1-9]|1[012])

For the day part of the problem, you first have the situation where the first digit is 0. In that situation, the second digit must be in the range 1 through 9. A suitable pattern is:

0[1-9]

When the first digit of the day is 1, the second digit must be 0 through 9. A suitable pattern is:

1[0-9]

Similarly, when the first digit of the day is 2, the second digit must be 0 through 9. A suitable pattern is:

2[0-9]

You can combine those two patterns as follows:

[12][0-9]

Finally, when the first digit of the day is 3, the second digit must be 0 or 1. A suitable pattern is:

3[01]

The patterns for the day are mutually exclusive choices, so you can combine them using the bar, |, character as follows:

(0[1-9]|[12][0-9]|3[01])

The whole pattern then becomes the following:

(20|19)[0-9]{2}[-./](0[1-9]|1[012])[-./](0[1-9]|[12][0-9]|3[01])

This pattern is an improvement on the previous one, but not perfect. It does not, for example, recognize that there is something wrong with a date 2004/02/30. Because there is no such day as February 30, you might not want to allow that to match. Whether you want to develop the pattern further would depend on how important it is to you to ensure that all spurious dates be identified.

709

Appendix A

Chapter 6

1.The pattern the\> matches occurrences of the sequence of characters the when they occur at the end of a word. The corresponding pattern using the \b metacharacter is the\b.

2.The pattern asked for won’t match a word boundary before the sequence of characters the (because it won’t match the in then); neither will it match a word boundary after the sequence of characters the (because it won’t match the in lathe). Therefore, you need a pattern that matches something that isn’t a word boundary followed by the sequence of characters the, followed by something that isn’t a word boundary.

The problem definition could be expressed as follows:

Match a position that isn’t a word boundary followed by the sequence of characters t, h, and e, followed by a position that isn’t a word boundary.

A suitable pattern is:

\Bthe\B

Figure A-1 shows the solution in the Komodo Regular Expressions Toolkit.

Figure A-1

Chapter 7

1.Three patterns that will match each of the two desired sequences of characters follow:

(licence|license)

or:

licen(c|s)e

710

Exercise Answers

or:

licen[cs]e

In each of the three options shown, reversal of the order of options inside the parentheses also gives correct answers.

2.The following pattern is the only solution:

(Fear|fear) \1

The following pattern captures only the initial f or F, so it doesn’t detect whether the second word is fear or, for example, Fred:

(F|f)ear \1

The following pattern won’t work, because there are no parentheses; nothing is captured. Therefore, the \1 back reference won’t work:

[Ff]ear \1

Chapter 8

1.The alphabetic characters can match the pattern [A-Za-z]+, assuming that both uppercase and lowercase characters are to be matched. To specify that the alphabetic characters must be followed by a comma, simply add a literal comma inside the appropriate lookahead parentheses,

(?=,). The whole pattern is [A-Za-z]+(?=,).

2.The pattern (?<=\W)sheep(?=\W) will match the word sheep. The lookbehind (?<=\W) specifies that the character preceding the s of sheep must not be a word character. The lookahead (?=\W) specifies that the character following the p of sheep must not be a word character. The overall effect is that the word sheep is matched.

The pattern (?<=[^A-Za-z])sheep(?=[^A-Za-z]) would also match the word sheep.

Chapter 9

1.The original pattern was ^\w*(?<=\w)\.?\w+@(?=[\w\.]+\W)\w+\.\w{3,4}$. The necessary modification is straightforward. Instead of using \w{3,4} to match three-character or fourcharacter sequences after the final period character of the e-mail address, you can use alternation to offer the desired choice of specific domains.

That is matched by the pattern (com|net|org). Remember not to include any space characters inside the parentheses, or you will lose matches, because there is no space character in the e-mails in the test data.

The full pattern is:

^\w*(?<=\w)\.?\w+@(?=[\w\.]+\W)\w+\.(com|net|org)$

2.One approach would be to add further alternative specific lookaheads. The following pattern will match the additional characters that occur in the test text:

Star((?= Training)|(?=\.)|(?=\?)|(?=!))

711

Appendix A

Chapter 11

1.The pattern pe[ae]k will do what is required. Character classes are supported in Word.

2.Assuming that you use the pattern ([0-9]{2})[./-]([0-9]{2})[./-]([0-9]{4}) to match dates, only a small change is needed to replace U.K. dates.

U.K. dates are in DD/MM/YY format. So to output international date format, you can use the pattern:

\3-\2-\1

to order the components of the original date in the desired order.

Chapter 12

The pattern [A-V] is the most convenient solution.

The pattern [a-ht-z] is a convenient solution.

Figure A-2 shows the solution applied to the test file ClassTest.txt.

Figure A-2

712

Exercise Answers

Chapter 13

1.The following command would find lines that contain the desired matches:

findstr “[A-Z][A-KO-Z][A-Z][0-9][0-9][0-9]” filename*.extension

2.Either of the following will find the desired matches:

findstr /i “^the” filename*.extension

or:

findstr /i /b “the” filename*.extension

Chapter 14

1.The reason that regexp is never matched is that the option regexp occurs earlier among the options in the following and in the other alternate regular expressions:

(regular expression|regex|regexp)

If the character sequence regexp occurs, the option regex always matches; therefore, the pattern regexp is never matched.

The problem is solved very simply, by moving the pattern regexp before the pattern regexp in the options, as follows:

(regular expression|regexp|regex)

Similar simple reordering solves the problem in the second alternate pattern, giving you the following:

reg(ular expression|exp|ex)

However, the following third option cannot be similarly changed:

reg(ular expression|(ex)p?)

It could be changed to the option shown earlier:

reg(ular expression|exp|ex)

If you provide options that are very similar, ensure that the more difficult-to-match options precede similar but simpler options if you want to avoid this class of problem.

2.The following pattern would match:

\$\d{2}\.\d{2}

Remember that the $ metacharacter, if not escaped, matches the end-of-line position. So to match the dollar sign, the escaped \$ is used. Remember, too, that to match the decimal point the \. metacharacter matches only the period character.

If you used a pattern such as the following, using the period metacharacter (unescaped) would match, for example, $12345, which is not the desired result:

\$\d{2}.\d{2}

713

Appendix A

Chapter 15

1.Assuming that the filter has already been created, select the Custom menu option from the firstname drop-down list (cell C3). Select the Begins With option from the top-left drop-down list in the Custom AutoFilter dialog box. Enter the pattern Kar* in the top-right text box, and click the OK button. Only the rows for Karen Peters and Kara Stelman should be displayed.

2.Use Ctrl+F to open the Find and Replace dialog box. Enter the pattern Ju? in the Find What text box. Check the Match Entire Cell Contents check box. Click the Find All button.

Chapter 16

1.There is only one surname, Gringlesby, in the authors table that is not desired as a match. There are several options for a solution, given the existing data. One option is:

USE pubs

SELECT au_lname, au_fname FROM dbo.authors

WHERE au_lname LIKE ‘Gre%’

ORDER BY au_lname

The pattern Gre% will match Green and Greene and not match Gringlesby.

Other options include the following:

WHERE au_lname LIKE ‘Gree%’

and:

WHERE au_lname LIKE ‘Green%’

2.The following Transact-SQL code will achieve the specified results:

USE pubs

SELECT title_id, title, pubdate FROM dbo.titles

WHERE title LIKE ‘%data%’

ORDER BY title

The pattern %data% will match a value for the title column that has zero or more characters preceding the character sequence data and has zero or more characters following the character sequence. In other words, any title containing the character sequence data is matched.

Chapter 17

1.The following lines of code are functionally equivalent, and each is a suitable answer to the question:

SELECT “1950-01-01” LIKE “195%”;

or:

SELECT ‘1950-01-01’ LIKE ‘195%’;

714

Exercise Answers

2.The following code would do what is required:

USE BRegExp;

SELECT ID, LastName, FirstName, Skills FROM Employees

WHERE Skills REGEXP ‘\.NET’

;

Be careful to escape the period character as \.; otherwise, (depending on the data), you might get undesired matches on words such as Ethernet, because the period metacharacter (if it wasn’t escaped) would match the r or Ethernet.

Chapter 18

1.The following code will meet the requirements of the question:

SELECT dBeachPurchases.ItemTitle, dBeachPurchases.ItemAuthor

FROM dBeachPurchases

WHERE dBeachPurchases.ItemAuthor LIKE ‘*B?rns*’;

The query has been created for you in the B?rns in ItemAuthor query in the

AuctionPurchases.mdb database.

Alternatively, the ? metacharacter could be replaced by the _ metacharacter, and/or the * metacharacter could be replaced by the % metacharacter, giving the following code if both replacements are made:

SELECT dBeachPurchases.ItemTitle, dBeachPurchases.ItemAuthor

FROM dBeachPurchases

WHERE dBeachPurchases.ItemAuthor LIKE ‘%B_rns%’;

2.When attempting to answer this question, you will probably notice that Access lacks a quantifier, which signifies that the preceding character is optional. In many regular expression implementations, the pattern Ma?cDonald would be what you need to match McDonald or MacDonald.

Because Access doesn’t have such a metacharacter, which is an optional quantifier, you need to take a different approach.

You need code that will match McDonald anywhere in the ItemAuthor field and code that will match MacDonald anywhere in the ItemAuthor field. The following code in a WHERE clause does the first:

WHERE dBeachPurchases.ItemAuthor LIKE “*McDonald*”

The next piece of code in a WHERE clause does the second:

WHERE dBeachPurchases.ItemAuthor LIKE “*McDonald*”

Because you want to match both possibilities, you can use the OR keyword to produce a compound WHERE clause like this:

WHERE dBeachPurchases.ItemAuthor LIKE “*McDonald*” OR dBeachPurchases.ItemAuthor

LIKE “*MacDonald*”

715

Appendix A

If the WHERE clause is the final line of the SQL code, you will need to add a terminating semicolon to the line:

a.From the Database Objects window, select Queries, and click the New button.

b.Select Design View from the New Query dialog window. Figure A-3 shows the screen’s appearance at that point.

Figure A-3

c.Click the Add button to make the dBeachPurchases table available to query. In the left column of the grid, select ItemTitle in the leftmost Field drop-down list.

d.In the next column, select ItemAuthor in the Field drop-down list. In the same column, add the following to the Criteria cell:

LIKE “*McDonald*” OR LIKE “*MacDonald”

e.Use Ctrl+S to save the new query; name it Mac or McDonald in ItemAuthor. Close the window in which you created the query.

If you open the window later in Design view, it will look like Figure 18-21. Notice that LIKE has been replaced, by Access, with Like. Because SQL is case insensitive, this change made by Access is purely cosmetic but can be a little confusing if you consistently use uppercase for SQL keywords.

Similarly, depending on what you hand-code or produce in Design view, you may find that Access adds paired parentheses, which are not strictly necessary, as in the following code, which is what I saw in the SQL view:

SELECT dBeachPurchases.ItemTitle, dBeachPurchases.ItemAuthor

FROM dBeachPurchases

WHERE (((dBeachPurchases.ItemAuthor) Like “*MacDonald*”)) OR

(((dBeachPurchases.ItemAuthor) Like “*McDonald*”));

716

Exercise Answers

Chapter 19

1.The only part of FinalT.html that you need to change is the line where the myRegExp variable is declared:

var myRegExp = /^\d+$/;

The pattern ^\d+$ matches the beginning-of-string position (using the ^ metacharacter) and then matches one or more (using the + metacharacter) numeric digits (signified by the \d metacharacter), followed by the end-of-string position (signified by the $ metacharacter).

The complete file, NumericDigitsOnly.html, is included in the code download.

2.Apart from modifying minor cosmetic details, for example, in the title element of the Web page, you will need to alter only the assignment statement for the myRegExp global variable.

The problem definition can be stated as follows:

Match a sequence of four numeric digits followed by a whitespace character, followed by a sequence of four numeric digits, followed by a whitespace character, followed by a sequence of four numeric digits, followed by a whitespace character, followed by a sequence of four numeric digits.

So the pattern is as follows:

\d{4}\s\d{4}\s\d{4}\s\d{4}

And the declaration of the myRegExp variable together with its documentation would look like this:

var myRegExp = /\d{4}\s\d{4}\s\d{4}\s\d{4}/;