Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Beginning Regular Expressions 2005.pdf
Скачиваний:
95
Добавлен:
17.08.2013
Размер:
25.42 Mб
Скачать

Metacharacters and Modifiers

of whitespace such as space characters, tab characters, or any mix of those characters before the rightangled bracket, >. So you cannot assume that there is no whitespace there and should allow for the possibility of whitespace existing.

The rules of XML also allow the use of newline characters inside the start tag of an element. In the file Person2.xml, two newline characters are used to lay out the start tag of the Person element in a potentially more readable way. As far as an XML parser is concerned, this is the same document as Person1.xml, because the logical structure is the same. But as far as a regular expressions engine is concerned, it is different because it contains a different sequence of characters in the start tag of the Person element.

<?xml version=’1.0’?> <Person DateOfBirth=”1970/01/12”

>

<FirstName>John</FirstName>

<LastName>Scoliosis</LastName>

</Person>

If you are to be able to use regular expressions to match sequences of characters inside XML documents, you need to allow for such use of whitespace inside start tags and elsewhere in XML documents.

Similar considerations regarding whitespace apply to HTML and XHTML documents.

Many regular expression implementations provide one or more characters, which can match some or all of the likely whitespace characters. First, let’s look at the \s metasequence.

The \s Metacharacter

The \s metacharacter is the least specific of the metacharacters that can match any single whitespace character. The \s metacharacter can match a space character, a tab character, or a newline character.

Try It Out

The \s Metacharacter

1.Open the Komodo Regular Expressions Toolkit, and clear any residual regular expression and test text.

2.In the Enter a String to Match Against text area, type ABC, press the Return key, and then type

DEF.

3.In the Enter a Regular Expression area, type the pattern \s.

4.Inspect the result in the Enter a String to Match Against area and the gray area below it. Notice that the invisible character (which is a newline character) immediately after the C of ABC is highlighted in pale green on-screen.

Figure 4-14 shows the expected appearance. At this point, the \s metacharacter matches a newline character.

5.Delete the regular expression and the test text. (The reason for doing so is that sometimes in Komodo version 2.5, the highlighting is misplaced after editing.)

6.In the Enter a String to Match Against area, type the string ABC DEF (that is, ABC, then a space character, then DEF).

93

Chapter 4

Figure 4-14

7.In the Enter a Regular Expression area, type the pattern \s, and inspect the results.

Figure 4-15 shows the expected appearance of the result. At this point, the \s metacharacter matches a space character (shown in Komodo Regular Expression Toolkit as a mid dot).

Figure 4-15

It isn’t possible to type a tab character in the Enter a String to Match Against area, but it can be pasted into the area.

8.Delete the regular expression and test text.

9.Open Notepad, and type ABC, followed by a tab character, followed by DEF.

94

Metacharacters and Modifiers

10.Use the Ctrl+A keyboard shortcut to select all the text in Notepad, and use the Ctrl+C keyboard shortcut to copy the selected text.

11.In the Enter a String to Match Against area of the Komodo Regular Expression Toolkit, use the Ctrl+V keyboard shortcut to paste the copied text (including the tab character). In the Komodo Regular Expression Toolkit, the tab character is shown as a right-pointing arrow (which can be seen in Figure 4-16).

12.In the Enter a Regular Expression area, type the regular expression pattern \s, and inspect the results.

Figure 4-16 shows the expected results. Note that the right-pointing arrow (which represents a tab character) between ABC and DEF is highlighted as a match.

Figure 4-16

How It Works

The \s metacharacter matches any kind of whitespace character.

With the test text specified in Step 2 (which includes a newline character) matching fails until the regular expression engine reaches the position after the C of ABC. At that position, the character that follows is a newline character. That matches the \s metacharacter. The matching character is indicated by pale green highlighting on-screen immediately after the C of ABC.

With the test text specified in Step 6 (which includes a space character), matching fails until the regular expression engine reaches the position after the C of ABC. At that position, the character that follows is a space character. A space character is a match for the \s metacharacter. The matching space character is indicated by pale green highlighting on-screen after the C of ABC.

95

Chapter 4

Handling Optional Whitespace

Matching optional whitespace is a task that is commonly required when dealing with HTML, XHTML, and XML documents.

For the purposes of the following example, assume that only paired double quotation marks are used in the test text to delimit the value of the DateOfBirth attribute (XML syntax also allows paired apostrophes).

Try It Out

Matching Optional Whitespace

1.Find the file CheckWhitespace.html in Windows Explorer, and double-click the file to open it in the default browser.

2.Click the Click Here to Enter Text button, and in the alert window that opens, type the test text

<Person DateOfBirth=”AnythingGoesHere” >. Be sure not to have any space characters on either side of the = character.

3.Click the OK button, and inspect the alert window that is displayed. Figure 4-17 shows the expected result after Step 3 using the Firefox browser.

Figure 4-17

How It Works

The test file CheckWhitespace.html is shown here:

<html>

<head>

<title>Check start tag for optional whitespace</title> <script language=”javascript” type=”text/javascript”> var myRegExp = /<Person DateOfBirth=”.*”\s*>/;

96

Metacharacters and Modifiers

function Validate(entry){ return myRegExp.test(entry); } // end function Validate()

function ShowPrompt(){

var entry = prompt(“This script tests for matches for the regular expression pattern:\n “ + myRegExp + “.\nType in a string and click on the OK button.”, “Type your text here.”);

if (Validate(entry)){

alert(“There is a match!\nThe regular expression pattern is: “ + myRegExp + “.\n The string that you entered was: ‘“ + entry + “‘.”);

}// end if else{

alert(“There is no match in the string you entered.\n” + “The regular expression pattern is “ + myRegExp + “\n” + “You entered the string: ‘“ + entry + “‘.” );

}// end else

}// end function ShowPrompt()

</script>

</head>

<body>

<form name=”myForm”> <br />

<button type=”Button” onclick=”ShowPrompt()”>Click here to enter text.</button> </form>

</body>

</html>

Notice the line where the variable myRegExp is declared:

var myRegExp = /<Person DateOfBirth=”.*” *>/;

Remember that forward slashes are used in JavaScript to delimit a regular expression. So the regular expression pattern to be matched is as follows:

<Person DateOfBirth=”.*”\s*>

Most of the characters in the pattern are literal characters. Notice that the value of the DateOfBirth attribute is to match the pattern .*; in other words, it will match zero or more characters (matching almost anything other than a newline character). After the second of the paired double quotation marks around the value of the DateOfBirth attribute, the pattern is \s* (a whitespace character followed by the asterisk quantifier), meaning a match of zero or more whitespace characters.

Test the regular expression by entering test text several times, using different numbers of spaces before the > character. You won’t be able to directly test that the \s metacharacter matches the tab character or newline character. Attempting to enter a tab character will shift the focus away from the line where you enter text. Attempting to enter a newline character is equivalent to clicking the OK button. However, you can paste text containing tab or newline characters into the dialog box.

97