Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Beginning Regular Expressions 2005.pdf
Скачиваний:
95
Добавлен:
17.08.2013
Размер:
25.42 Mб
Скачать

Chapter 24

Derivation by Restriction

When using W3C XML Schema, there are often several ways to specify a specific desired structure. Of the methods of derivation in the preceding list, derivation by restriction is the most commonly used.

One method of restriction is to specify an enumeration. The following XML instance document, BookEnum.xml, is associated with a W3C XML Schema document that contains an enumeration:

<?xml version=”1.0” encoding=”UTF-8”?>

<Book xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance” xsi:noNamespaceSchemaLocation=”C:\BRegExp\Ch24\BookEnum.xsd”>

<Chapter number=”1”>Some content</Chapter> <Chapter number=”2”>Some content</Chapter> <Chapter number=”3”>Some content</Chapter> <Chapter number=”4”>Some content</Chapter> <Chapter number=”5”>Some content</Chapter>

</Book>

The associated W3C XML Schema document, BookEnum.xsd, created by XMLSpy, constrains the values of the number attribute of the Chapter element to be an enumeration of values from 1 through 5:

<?xml version=”1.0” encoding=”UTF-8”?>

<xs:schema xmlns:xs=”http://www.w3.org/2001/XMLSchema” elementFormDefault=”qualified”>

<xs:element name=”Book”> <xs:complexType>

<xs:sequence>

<xs:element ref=”Chapter” maxOccurs=”unbounded”/> </xs:sequence>

</xs:complexType>

</xs:element>

<xs:element name=”Chapter”> <xs:complexType>

<xs:simpleContent>

<xs:extension base=”xs:string”> <xs:attribute name=”number” use=”required”>

<xs:simpleType>

<xs:restriction base=”xs:NMTOKEN”> <xs:enumeration value=”1”/> <xs:enumeration value=”2”/> <xs:enumeration value=”3”/> <xs:enumeration value=”4”/> <xs:enumeration value=”5”/>

</xs:restriction>

</xs:simpleType>

</xs:attribute>

</xs:extension>

</xs:simpleContent>

</xs:complexType>

</xs:element>

</xs:schema>

602

Regular Expressions in W3C XML Schema

The value of the number attribute is a simple type value. The schema document that XMLSpy creates uses the xs:NMTOKEN datatype, because the sample values of 1, 2, 3, 4, and 5 in the XML instance document allow for that datatype. However, the same constraint on values could be applied using the xs:pattern element as in BookPattern.xsd, shown here:

<?xml version=”1.0” encoding=”UTF-8”?>

<xs:schema xmlns:xs=”http://www.w3.org/2001/XMLSchema” elementFormDefault=”qualified”>

<xs:element name=”Book”> <xs:complexType>

<xs:sequence>

<xs:element ref=”Chapter” maxOccurs=”unbounded”/> </xs:sequence>

</xs:complexType>

</xs:element>

<xs:element name=”Chapter”> <xs:complexType>

<xs:simpleContent>

<xs:extension base=”xs:string”> <xs:attribute name=”number” use=”required”>

<xs:simpleType>

<xs:restriction base=”xs:NMTOKEN”> <xs:pattern value=”(1|2|3|4|5)” /> </xs:restriction>

</xs:simpleType>

</xs:attribute>

</xs:extension>

</xs:simpleContent>

</xs:complexType>

</xs:element>

</xs:schema>

An XML instance document associated with BookPattern.xsd is provided as BookPattern.xml in the code download. The only change from BookEnum.xml is that the xsi:noNamespaceSchemaLocation attribute points to the BookPattern.xsd file:

<Book xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance”

xsi:noNamespaceSchemaLocation=”C:\BRegExp\Ch24\BookPattern.xsd”>

The xs:pattern element is featured prominently in the remainder of this chapter, because it is the W3C XML Schema element that uses regular expressions. The value of the xs:pattern element’s value attribute is a regular expression pattern — hence, the name of the element.

In the pattern shown in the preceding code listing, notice that the value of the value attribute is a fairly simple example of alternation, (1|2|3|4|5), which allows the value to be any one value of 1, 2, 3, 4, or 5.

Before looking at the range of metacharacters supported in W3C XML Schema and how those metacharacters can be used, read about how Unicode is relevant to regular expressions in W3C XML Schema documents.

603

Chapter 24

Unicode and W3C XML Schema

XML documents consist of sequences of Unicode characters. Unicode contains many thousands of characters. In reality, few, if any, applications can display all Unicode characters, and very few human beings could easily understand all Unicode characters. To make Unicode more manageable, the characters are divided into Unicode character classes and Unicode blocks. Each of these is discussed later in this section.

Full information about Unicode is located at www.unicode.org. At the time of this writing, the current version of the Unicode Standard is version 4.0.1. Further information about the Unicode Standard is located at www.unicode.org/ standard/standard.html.

Unicode Overview

The Unicode Standard defines the universal character set. The aim of Unicode is to allow the interchange of text content across all the languages of planet Earth. Unicode specifies a text encoding for most characters of most languages, as well as characters to assist in interoperability with older character encodings.

The Windows Character Map utility provides a convenient way to examine the Unicode codes for many individual characters. Figure 24-6 shows the uppercase A selected. Notice in the lower part of the figure that uppercase A is U+0041. The number following the U and the + sign must consist of at least four numeric digits. The number is a sequence of hexadecimal digits. In this example, uppercase A is hexadecimal 0041, which is 65 in decimal notation.

Figure 24-6

604

Regular Expressions in W3C XML Schema

In XML, uppercase A can also be written as A. In most situations, it is simpler to express characters commonly used in English literally.

A Unicode character class indicates the type of usage for a set of characters — for example, lowercase letters. A Unicode character block indicates a language or other means of expression associated with that block of characters.

Using Unicode Character Classes

When using a Unicode character class in W3C XML Schema documents, the character class is specified as follows:

\p{characterClass}

The following table summarizes the Unicode character classes supported in W3C XML Schema.

Unicode Character Class

Description

 

 

C

Other characters

Cc

Control characters

Cf

Format characters

Cn

Unassigned code points

L

Letters

Ll

Lowercase letters

Lm

Modifier letters

Ln

Other letters

Lt

Title-case letters

Lu

Uppercase letters

M

All marks

Mc

Space-combining marks

Me

Enclosing marks

Mn

Nonspacing marks

N

Numbers

Nd

Decimal digits

Nl

Number letters

No

Other numbers

P

Punctuation

Pc

Connector punctuation

 

Table continued on following page

605