Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Beginning Perl Web Development - From Novice To Professional (2006)

.pdf
Скачиваний:
56
Добавлен:
17.08.2013
Размер:
2.9 Mб
Скачать

170 C H A P T E R 9 X M L PA R S I N G W I T H P E R L

XML::Simple Options

Some XML::Simple options are more important than others. This section examines two frequently used XML::Simple options: forcearray and KeyAttr. For a full listing and explanation of all of the XML::Simple options, see perldoc XML::Simple.

Forcearray

You already saw the forcearray option in an example, but that example didn’t really need to have forcearray enabled. Consider this XML:

<?xml version="1.0"?> <customer-data> <customer>

<first_name>Frank</first_name> <last_name>Sanbeans</last_name> <dob>3/10</dob> <email>frank@example.com</email> <vehicle>Volvo S60</vehicle> <vehicle>Honda Accord</vehicle>

</customer>

<customer> <first_name>Sandy</first_name> <last_name>Sanbeans</last_name> <dob>4/15</dob> <email>sandy@example.com</email> <vehicle>McLaren MP4-20</vehicle> <vehicle>Chevrolet S-10</vehicle>

</customer> </customer-data>

The following is code to parse this XML. Notice its similarities to the first example shown in the chapter.

#!/usr/bin/perl

use strict;

use XML::Simple;

my $xml = XMLin('./xml_example2',forcearray=>1);

foreach my $customer (@{$xml->{customer}}) { print "Name: $customer->{first_name}->[0] "; print "$customer->{last_name}->[0]\n"; print "Birthday: $customer->{dob}->[0]\n";

print "E-mail Address: $customer->{email}->[0]\n"; print "Vehicle(s): @{$customer->{vehicle}}\n";

}

C H A P T E R 9 X M L PA R S I N G W I T H P E R L

171

The code reads in the XML with the forcearray option enabled. Each value is then accessed through its dereferenced array index. The exception is the <vehicle> element, which may be multivalued. The <vehicle> element is printed by using an array reference to that specific element.

Since four of the five elements in this XML data structure can have only one value, you may want to specify which elements should go into an array, instead of rolling every element into an array. Instead of merely accepting 1 for enabled and 0 for disabled, the forcearray option can accept a comma-separated list of elements that should be rolled into an array. Continuing with the previous example, using forcearray just for the <vehicle> element looks like this:

#!/usr/bin/perl

use strict;

use XML::Simple;

my $xml = XMLin('./xml_example2',forcearray=> [ 'vehicle' ]);

foreach my $customer (@{$xml->{customer}}) { print "Name: $customer->{first_name} "; print "$customer->{last_name}\n";

print "Birthday: $customer->{dob}\n";

print "E-mail Address: $customer->{email}\n"; print "Vehicle(s): @{$customer->{vehicle}}\n";

}

Notice in this example that the array index ->[0] is now gone for the original four elements, since those are no longer rolled into an array by XML::Simple.

KeyAttr

The KeyAttr option is used to control how elements are rolled into arrays and hashes. Recall an earlier example showing an XML structure as displayed with Data::Dumper. Changing the key attribute to be the <email> element reveals this output:

$VAR1 = {

'customer' => {

'frank@example.com' => {

'dob' => '3/10', 'first_name' => 'Frank', 'last_name' => 'Sanbeans'

}, 'sandy@example.com' => {

'dob' => '4/15', 'first_name' => 'Sandy', 'last_name' => 'Sanbeans'

}

}

};

172 C H A P T E R 9 X M L PA R S I N G W I T H P E R L

The code that produced this output is similar to the earlier Data::Dumper example. Notice the use of the KeyAttr option in the call to the XMLin() subroutine.

#!/usr/bin/perl

use strict;

use XML::Simple; use Data::Dumper;

my $xml = XMLin('./example1.xml',forcearray=> [ 'vehicle' ],KeyAttr=>[ 'email' ] TheSansMonoConNormal

);

print Dumper($xml);

Note Take a look through the perldoc for XML::Simple. The documentation for this module is quite good and actually helps to sort out the options into categories such as important, handy, advanced, and others.

Parsing XML with XML::SAX

XML::SAX is a stream-based parser. When XML is parsed by XML::SAX, each new element encountered signals an event to the parser. XML::SAX hands off the processing for that event to the appropriate method. It’s your responsibility to write handlers for events as they are passed by XML::SAX. For example, XML::SAX will encounter the beginning of an XML tag. When it does so, it passes the information along in an event stream to the parser. This parser implements a number of handlers to work with that event. The data from the event is usually placed inside a hash, but that depends on the type of event.

There are parsers already written for XML::SAX. Two such handlers are XML::LibXML:: SAX::Parser and XML::SAX::PurePerl. XML::LibXML::SAX::Parser requires the libxml2 library and is written in C. As you might guess by the name, XML::SAX::PurePerl is written entirely in Perl. These two handlers, along with others for XML::SAX, may be already installed on your system. In practice, you’ll find that you’ll be writing your own parser more often than not.

This program prints a list of the available parsers on your system:

#!/usr/bin/perl

use XML::SAX; use strict;

my @parsers = @{XML::SAX->parsers()};

foreach my $parser (@parsers) {

print "--> ", $parser->{ Name }, "\n";

}

C H A P T E R 9 X M L PA R S I N G W I T H P E R L

173

On my Debian (Sarge) system, the output looks like this:

--> XML::SAX::PurePerl

--> XML::LibXML::SAX::Parser

--> XML::LibXML::SAX

--> XML::SAX::Expat

A parser is chosen through the XML::SAX::ParserFactory interface. However, in practice, programmers frequently leave it up to XML::SAX to decide which parser to use, with the default being decided by the order in which the parsers were installed. Though this may sound blatantly obvious, parsers implement functions to parse XML. The parser methods include parse_uri(), parse_file(), and so on.

It’s important to realize the difference, from an XML::SAX standpoint, between a parser and a handler. A parser is usually software in the form of a Perl module that is installed with XML::SAX or can be installed from CPAN. A handler, on the other hand, is software that you write as part of the XML parsing programming task. The parser is created or instantiated by the XML::SAX::ParserFactory and is passed an argument telling it which handler will be used. The handler then implements interfaces for events handed to it from the parser. Note that

a parser can be passed numerous arguments in addition to the name of the handler to use.

XML::SAX Parser Methods

As previously stated, a parser implements several methods for parsing XML. For most of the methods, you pass the XML as an argument, as well as other options for parsing. The parser methods are as follows:

parse([options]): This is a generic method that can accept optional options in list, name=>value pairs, or hash format.

parse_uri(uri [, options]): This is a commonly used method to parse XML as denoted by the URI.

parse_file(filestream [, options]): This method parses a filestream such as a filehandle. Do not confuse this method with an argument of a plain file rather than

a stream.

parse_string(string [, options]): This method parses the XML contained in the string passed to it.

SAX2 Handler Interfaces

The handler that you create will need to implement code to handle events as they are passed in by the parser. XML::SAX provides access to events to ensure that the SAX2 specification is met. XML::SAX and related parsers also work with namespaces.

Logically, XML::SAX events and handlers can be grouped into categories. Many of the more common handlers fall into the category of content handlers. Content handlers work with the actual content of the document itself, and so content handlers are where you’ll spend a large amount of time coding. Another important category of handlers includes the error handlers that enable you to create custom error handling code. Other handlers include lexical handlers that work with CDATA sections, comments, DTDs, and entities. The following sections look at

174 C H A P T E R 9 X M L PA R S I N G W I T H P E R L

Content Event Handlers

This following are some of the event handlers that you’ll encounter when working through the content of the XML:

start_document(document): The start_document() method is sent with an empty document parameter.

end_document(document): The end_document() method can be called at the end of the XML input or when an error occurs. Whatever this method returns—whether it’s an error condition or normal condition—it will be used as the return value by the parse() method.

start_element(element): The start_element() method is called when a new start tag is found. The method is passed a hash parameter, element, containing the following:

Name: The name of the element, including any namespace prefix.

Attributes: Contains a hash of attributes, if any. Be careful not to confuse the attributes with other XML data. The attributes are themselves a hash. The attributes hash contains Name (full name, including prefix), Value (value of the attribute, trimmed to remove spaces), NamespaceURI (URI) for the namespace, Prefix (portion of the full name before the local portion), and LocalName (portion of the full name that is local).

NamespaceURI: The namespace for the element.

Prefix: The prefix for the name.

LocalName: The portion of the full name that is local.

end_element(element): The end_element() method is called when a new end tag is found. The method is passed a hash parameter, element, containing the following:

Name: The name of the element, including any namespace prefix.

NamespaceURI: The URI for the namespace.

Prefix: The portion of the full name before the local portion.

LocalName: The portion of the full name that is local.

characters(data): The characters() method is used for any character data (plain text) in the XML. This method is most frequently used to obtain the actual values contained within the XML, but there’s no guarantee that this method will be called for only those values. In other words, it could be called for any other character data encountered in the XML. The data parameter is a hash containing the string of characters.

Other content handlers include processing_instruction, skipped_entity, ignorable_whitespace, and set_document_locator, among others. See the perldoc on

XML::SAX for more information about these and other content event handlers that you may want to code into your handler.

C H A P T E R 9 X M L PA R S I N G W I T H P E R L

175

Error Event Handlers

Three error event handlers exist:

warning(): A warning is an error that doesn’t stop the parser but is notable nonetheless. You can choose to forego implementing this handler; in which case, the parser will simply ignore the warning.

error(): An error is a serious event, but the parser will continue. An invalid XML document is an example of an error.

fatal_error(): The most serious of the error events, as the name implies, this type of event can cause processing of XML to stop, though the parser may choose to continue.

Each handler accepts a single argument: the exception.

A Basic Parser and Handler

Now that you have an idea of the theory of parsing XML with XML::SAX, as well as a look at some of the more important events to be implemented by your handler, it’s time to get busy with coding your first parser. This section will show you how to code a parsing routine using XML::SAX by creating your own handler for some content events. First, recall the XML from earlier in the chapter. This XML will be used throughout this section:

<?xml version="1.0"?> <customer-data> <customer>

<first_name>Frank</first_name> <last_name>Sanbeans</last_name> <dob>3/10</dob> <email>frank@example.com</email> <vehicle>Volvo S60</vehicle> <vehicle>Honda Accord</vehicle>

</customer>

<customer> <first_name>Sandy</first_name> <last_name>Sanbeans</last_name> <dob>4/15</dob> <email>sandy@example.com</email> <vehicle>McLaren MP4-20</vehicle> <vehicle>Chevrolet S-10</vehicle>

</customer> </customer-data>

Two elements involved in parsing with XML::SAX: the parser, or main program code, and the handler code. The first task is to create the main program code, which will import XML::SAX into the namespace and set up the parser, as well as perform any other functions that you might want the program to perform outside the XML-specific items. Then you write the handler code, which will largely be specific to the XML being parsed.

176 C H A P T E R 9 X M L PA R S I N G W I T H P E R L

Coding the Main Program

You use the use pragma to import XML::SAX into the namespace. However, an additional use pragma is also necessary in order to import your yet-to-be-coded handler package.

use XML::SAX; use MyHandler; use strict;

Since this handler package is something that you will create, you can name it as you wish (assuming a valid name, of course). Don’t fret the details of the handler package yet; you’ll be coding it shortly.

Next, create a parser object and pass it a reference to the handler that you’ll be creating:

my $parser = XML::SAX::ParserFactory->parser( Handler => MyHandler->new);

One of the parser methods is called next. For this example, assume that the XML is stored in a file called example1.xml in the current directory:

$parser->parse_uri("example1.xml");

That’s all there is to the code for the main program. Save the main program as xml-custom.pl and make it executable (chmod 700 xml-custom.pl).

This code is rather simple. The key to the code is within the handler package, MyHandler, which you’ll create as a separate file. This program, as it stands now, will produce an error if you attempt to execute it, since the handler package doesn’t exist yet.

Creating the Handler Package

With the main program coded, the parser will be invoked and will attempt to pass events to the specified handler. Create a separate file, called MyHandler.pm, to hold the code for the custom handler. The handler package is coded as shown in Listing 9-1.

Listing 9-1. A Handler for Parsing XML

package MyHandler;

use base qw(XML::SAX::Base);

sub start_document { my $self = shift;

my $document = shift;

}

sub start_element { my $self = shift;

my $element = shift;

C H A P T E R 9 X M L PA R S I N G W I T H P E R L

177

print $element->{LocalName}, " = ";

}

sub characters {

my $self = shift; my $char = shift; print $char->{Data};

}

1;

The first task within the code is to declare it as a package:

package MySAXHandler;

From there, the XML::SAX::Base methods are imported into the namespace. This enables the handler package to take advantage of the XML::SAX framework:

use base qw(XML::SAX::Base);

Three subroutines follow: start_document(), start_element(), and characters(). Each of these is invoked by the parser as it works through the XML input. The start_document() routine is mostly a placeholder in this application. start_element() is important for parsing XML data. In this simple example, however, it does nothing more than print the name of each XML element, such as customer-data, customer, first-name, last-name, and so on, followed by an equal sign in the output. characters() is where the actual data of the XML is printed as output.

Running the Parser

You’ve created both the main program and the custom handler package. You can now run the xml-custom.pl program to parse the XML. It should produce this output:

customer-data = customer =

first_name = Frank last_name = Sanbeans dob = 3/10

email = frank@example.com

customer =

first_name = Sandy last_name = Sanbeans dob = 4/15

email = sandy@example.com

A common error when parsing XML is to have XML that is not well-formed. Make sure the XML is well-formed by taking advantage of the Data::Dumper package to print the XML quickly. Additionally, make sure that the names are correct for the handler package that you created.

For example, importing a misnamed handler package or calling it incorrectly in the main program or from within the handler code itself will cause errors.

178 C H A P T E R 9 X M L PA R S I N G W I T H P E R L

Including Attributes

Attributes are important within XML parlance. They are parsed as part of the start_element() subroutine, and their use may not be blatantly obvious at first glance. This is because, as you’ll recall, the attributes are sent as a hash within the elements hash. Here’s the sample XML from earlier, this time including attributes to indicate that the email element is a required field:

<?xml version="1.0"?> <customer-data>

<customer required="email"> <first_name>Frank</first_name> <last_name>Sanbeans</last_name> <dob>3/10</dob> <email>frank@example.com</email> <vehicle>Volvo S60</vehicle> <vehicle>Honda Accord</vehicle>

</customer>

<customer required="email"> <first_name>Sandy</first_name> <last_name>Sanbeans</last_name> <dob>4/15</dob> <email>sandy@example.com</email> <vehicle>McLaren MP4-20</vehicle> <vehicle>Chevrolet S-10</vehicle>

</customer> </customer-data>

For brevity’s sake, consider the code for the handler package MyHandler to be the same, with the exception of the start_element() subroutine, which now looks like this:

sub start_element { my $self = shift;

my $element = shift;

foreach my $key (keys %{ $element->{Attributes}}) { my $attrib = $element->{Attributes}->{$key};

print $attrib->{Name}, " = ", $attrib->{Value}, "\n";

}

print $element->{LocalName}, " = ";

}

Again, the remainder of the code for MyHandler.pm (Listing 9-1) is exactly the same as the earlier example. Now there’s a foreach loop to iterate through the keys to the attributes hash and then print them. The output is as follows. Notice the addition of the required = email line:

customer-data = required = email customer =

first_name = Frank last_name = Sanbeans

C H A P T E R 9 X M L PA R S I N G W I T H P E R L

179

dob = 3/10

email = frank@example.com vehicle = Volvo S60 vehicle = Honda Accord

required = email customer =

first_name = Sandy last_name = Sanbeans dob = 4/15

email = sandy@example.com vehicle = McLaren MP4-20 vehicle = Chevrolet S-10

You’ve now seen how to parse XML using XML::SAX by creating your own handler for XML::SAX parser events. However, the examples here have only scratched the surface of XML parsing with XML::SAX. It is a very powerful specification and package with Perl. I invite you to spend some time reading the XML::SAX and XML::SAX::Base documentation and experimenting with the code and with more complex examples to parse XML in Perl with this excellent module.

Using Tree-Based Parsing

The chapter began with a look at XML::Simple for parsing simple XML. You then read about parsing of XML with XML::SAX, a framework around which very complex XML parsing can be done. Tree-based parsing, or simply tree parsing, is yet another process for parsing XML. This method delivers the entire XML structure to your program as one logical entity, as opposed to the delivery in chunks that you get with a stream processor. As noted earlier in the chapter, tree parsers are almost always stream-based parsers at heart, but they hold the data until the end of the parsing.

Needing to pass the entire structure at once almost always means that tree parsers have higher memory requirements than their stream-based counterparts. Since XML structures can be quite complex, it’s not uncommon to receive an Out of Memory error when using

a tree parser on complex and/or lengthy XML.

Tree parsers include XML::Parser, which can be used both as a tree and a stream parser,

XML::Grove, XML::TreeBuilder, XML::Twig, and XML::SimpleObject, just to name a few. The parser that you saw earlier in the chapter, XML::Simple, is yet another tree parser.

Each XML parser has its own features and invariably its own syntax as well. XML::Twig, for example, is interesting in that it can hold part of the XML tree, thus saving memory. XML::Twig also provides a simple means for converting XML into HTML or into other formats. The following example prints XML using the indented option with XML::Twig:

#!/usr/bin/perl

use strict; use XML::Twig;