Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Beginning Regular Expressions 2005.pdf
Скачиваний:
95
Добавлен:
17.08.2013
Размер:
25.42 Mб
Скачать

Chapter 24

9.Attempt to validate WordUnicode3.xml against WordUnicode3.xsd (see Figure 24-9). On this occasion, there is no match.

Figure 24-9

How It Works

The files WordUnicode.xml and WordUnicode.xsd attempt to validate the German word Führer (leader) against the pattern \w+. This shows that in W3C XML Schema, the metacharacter matches some letters that aren’t used in English.

The files WordUnicode2.xml and WordUnicode2.xsd attempt to validate the German word Führer (leader) against the pattern \p{L}. Because the word Führer contains only Unicode letters, there is a match.

The files WordUnicode3.xml and WordUnicode3.xsd attempt to validate the German word Führer against the pattern \p{L} while also specifying the use of the Unicode character block BasicLatin, indicated by the pattern \p{IsBasicLatin}. Because the word Führer contains the letter ü, which is not in the range U+0000 through U+007F (it is U+00FC), there is no match, and validation fails.

Metacharacters Supported in W3C XML Schema

The metacharacters supported in W3C XML Schema include a few that relate directly to XML and are not implemented in most other regular expression implementations.

The following table summarizes the metacharacters supported in W3C XML Schema version 1.0. See also information in the preceding section about Unicode support in W3C XML Schema.

612

 

 

Regular Expressions in W3C XML Schema

 

 

 

 

Metacharacter

Description

 

 

 

 

^

Not supported outside negated character classes (see discussion on

 

 

positional metacharacters).

 

$

Not supported (see discussion on positional metacharacters).

 

\d

Matches a numeric digit.

 

\D

Matches a character that is not a numeric digit.

 

\s

Matches a whitespace character.

 

\S

Matches a character that is not a whitespace character.

 

\w

Matches a “word” character.

 

\W

Matches a character that is not a “word” character.

 

| (Pipe character)

Alternation. Allows a choice among two or more options of the pre-

 

 

ceding and following groups or characters.

 

?

Quantifier. Specifies that there is zero or one occurrence of the pre-

 

 

ceding character or group.

 

*

Quantifier. Specifies that there are zero or more occurrences of the

 

 

preceding character or group.

 

+

Quantifier. Specifies that there are one or more occurrences of the

 

 

preceding character or group.

 

{n,m}

Quantifier. Specifies that there is a minimum of n occurrences and a

 

 

maximum of m occurrences of the preceding character or group.

 

. (period character)

Matches any character or any character except the newline character.

 

[...]

Positive character class. One character contained between the square

 

 

brackets is matched once.

 

[^...]

Negated character class. One character not contained between the

 

 

square brackets is matched once.

 

\i

Matches a character allowed as a first character in an XML name.

 

 

Equivalent to the character class [A-Za-z_].

 

\I

Matches a character not allowed as a first character in an XML name.

 

 

Equivalent to the character class [^A-Za-z_].

 

\c

Matches an XML 1.0 name character. Includes the character class

 

 

[A-Za-z0-9.:_].

 

\C

Matches a character that is not an XML 1.0 name character.

 

 

 

Positional Metacharacters

In W3C XML Schema, the positional metacharacters, ^ and $, are not supported as beginning-of-line (or beginning-of-string) or end-of-line (or end-of-string) positional metacharacters due to a difference in how matching takes place in W3C XML Schema compared to many other regular expression implementations.

613

Chapter 24

In many regular expression implementations, the pattern [A-Z][0-9] will match any string containing an uppercase alphabetic character followed by a numeric digit. However, in W3C XML Schema, there is a match only if the whole string is matched by the pattern. In other words, when matching in W3C XML Schema, the pattern [A-Z][0-9] is interpreted as though it were ^[A-Z][0-9]$.

Because all W3C XML Schema regular expression patterns are interpreted as though both the ^ and $ metacharacters were already present, they are not supported separately from that implicit mechanism.

The ^ metacharacter can, however, be used in a negated character class.

Matching Numeric Digits

The \d metacharacter can be used to match a numeric digit. For example, the sample document Document.xml contains a number attribute that must be a single numeric digit:

<?xml version=”1.0” encoding=”UTF-8”?>

<Document xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance” xsi:noNamespaceSchemaLocation=”C:\BRegExp\Ch24\Document.xsd”>

<Section number=”1”>Content</Section> <Section number=”2”>Content</Section> <Section number=”3”>Content</Section>

</Document>

The corresponding W3C XML Schema document, Document.xsd, uses the \d metacharacter in an xs:pattern element to specify that the value of the Section element’s number attribute is a single numeric digit:

<?xml version=”1.0” encoding=”UTF-8”?>

<xs:schema xmlns:xs=”http://www.w3.org/2001/XMLSchema” elementFormDefault=”qualified”>

<xs:element name=”Document”> <xs:complexType>

<xs:sequence>

<xs:element ref=”Section” maxOccurs=”unbounded”/> </xs:sequence>

</xs:complexType>

</xs:element>

<xs:element name=”Section”> <xs:complexType>

<xs:simpleContent>

<xs:extension base=”xs:string”> <xs:attribute name=”number” use=”required”>

<xs:simpleType>

<xs:restriction base=”xs:NMTOKEN”> <xs:pattern value=”\d” /> </xs:restriction>

</xs:simpleType>

</xs:attribute>

</xs:extension>

</xs:simpleContent>

</xs:complexType>

</xs:element>

</xs:schema>

614

Regular Expressions in W3C XML Schema

The value of the xs:restriction element’s base attribute is shown as the type xs:NMTOKEN, but other types could be used in this situation, such as xs:byte.

Alternation

Alternation is supported in W3C XML Schema. The example using BookPattern.xml and BookPattern. xsd earlier in this chapter shows how alternation can be used with the xs:pattern element.

Using the \w and \s Metacharacters

The \w metacharacter represents word characters, including uppercase and lowercase A through Z. The \s metacharacter represents a whitespace character.

The pattern \w+\s+\w+ can be used to represent a name displayed as the first name followed by a space character(s), followed by last name. A sample document, Name.xml, is shown here:

<?xml version=”1.0” encoding=”UTF-8”?>

<Names xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance” xsi:noNamespaceSchemaLocation=”C:\BRegExp\Ch24\Name.xsd”>

<Name>John Smith</Name> <Name>Alicia Manton</Name> <Name>Pierre Laval</Name>

</Names>

A corresponding schema, Name.xsd, uses the pattern \w+\s+\w+ to specify how the value of the Name element is to be constructed:

<?xml version=”1.0” encoding=”UTF-8”?>

<xs:schema xmlns:xs=”http://www.w3.org/2001/XMLSchema” elementFormDefault=”qualified”>

<xs:element name=”Names”> <xs:complexType>

<xs:sequence>

<xs:element ref=”Name” maxOccurs=”unbounded”> <xs:simpleType>

<xs:restriction base=”xs:string”> <xs:pattern value=”\w+\s+\w+” /> </xs:restriction>

</xs:simpleType>

</xs:element>

</xs:sequence>

</xs:complexType>

</xs:element>

</xs:schema>

The pattern matches a sequence of word characters followed by one or more whitespace characters, followed by a sequence of word characters.

The pattern specified wouldn’t match names such as Maria Von Trapp because \w+\s+\w+ means, in effect, ^\w+\s+\w+$.

615