Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Beginning Perl Web Development - From Novice To Professional (2006)

.pdf
Скачиваний:
56
Добавлен:
17.08.2013
Размер:
2.9 Mб
Скачать

160 C H A P T E R 8 P E R L A N D R S S

$rsswriter->channel(

title => "My Watch Summary",

link => "http://www.braingia.org/",

description => "Weather Watches for KS, IA, MN, and WI"

);

As in the previous examples, you iterate over each item of the incoming RSS feed. Instead of printing all of the items to STDOUT, this time, each one is examined to see if the description contains one of the four states that you’re interested in for this example. If one of those states is listed within the incoming item’s description, the add_item() method is called on the $rsswriter object:

foreach my $item (@{$rss->{'items'}}) {

if ($item->{'description'} =~ /KS|WI|IA|MN/) { $rsswriter->add_item(

title => $item->{'title'}, description => $item->{'description'}, link => $item->{'link'}

);

}

}

Once each item in the incoming feed has been examined, it’s time to write the RSS feed. You can do this by using the save() method or by printing the feed with the as_string() method. I chose to save the RSS feed to a file called mywatchsummary.xml:

$rsswriter->save("mywatchsummary.xml");

If you would rather print the RSS to STDOUT, use a print statement with the as_string() method:

print $rsswriter->as_string;

Regardless of which method you use, the resulting file or output looks like this:

<?xml version="1.0" encoding="UTF-8"?>

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://purl.org/rss/1.0/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/"

>

<channel rdf:about="http://www.braingia.org/"> <title>Watch Summary for Western Great Lakes</title> <link>http://www.braingia.org/</link>

C H A P T E R 8 P E R L A N D R S S

161

<description>Weather Watches for IA, MN, and WI</description> <items>

<rdf:Seq>

<rdf:li rdf:resource="http://www.spc.noaa.gov/products/watch/ww0587.html" /> </rdf:Seq>

</items>

</channel>

<item rdf:about="http://www.spc.noaa.gov/products/watch/ww0587.html"> <title>SPC Severe Thunderstorm Watch 587</title> <link>http://www.spc.noaa.gov/products/watch/ww0587.html</link> <description>WW 587 SEVERE TSTM KS NE 031245Z - 031800Z</description> </item>

</rdf:RDF>

If you wanted to, you could also use the output attribute to convert between RSS versions. For example, the previously shown output is version 1.0. However, using the output attribute of the XML::RSS object, $rsswriter, you can change this to a different version on the fly. For example, the code to change the version just prior to printing the output looks like this:

$rsswriter->{'output'} = '0.91'; print $rsswriter->as_string;

The resulting output would show the change:

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN" "http://my.netscape.com/publish/formats/rss-0.91.dtd">

<rss version="0.91">

<channel>

<title>My Watch Summary</title> <link>http://www.braingia.org/</link>

<description>Weather Watches for KS, IA, MN, and WI</description>

<item>

<title>SPC Severe Thunderstorm Watch 587</title> <link>http://www.spc.noaa.gov/products/watch/ww0587.html</link> <description>WW 587 SEVERE TSTM KS NE 031245Z - 031800Z</description> </item>

</channel>

The final program is shown in Listing 8-3.

162C H A P T E R 8 P E R L A N D R S S

Listing 8-3. Retrieving Weather Watches with RSS

#!/usr/bin/perl

use strict; use XML::RSS;

use LWP::Simple;

my $url = get("http://www.spc.noaa.gov/products/spcwwrss.xml");

my $rss = XML::RSS->new; $rss->parse($url);

my $rsswriter = XML::RSS->new;

$rsswriter->channel(

title => "My Watch Summary",

link => "http://www.braingia.org/",

description => "Weather Watches for KS, IA, MN, and WI"

);

foreach my $item (@{$rss->{'items'}}) {

if ($item->{'description'} =~ /KS|WI|IA|MN/) { $rsswriter->add_item(

title => $item->{'title'}, description => $item->{'description'}, link => $item->{'link'}

);

}

}

$rsswriter->save("mywatchsummary.xml");

The XML:RSS module contains other methods and attributes that you may find helpful for your own RSS writing projects. Look over the perldoc for XML::RSS for more information about these methods and attributes.

Security Considerations with RSS

Creation of your own RSS feeds doesn’t pose any great security risk in itself. Of course, you don’t want to release any sensitive information through an RSS feed, just as you wouldn’t want to allow access to certain data through a web page or CGI program.

Consuming an RSS feed carries with it the same risks inherent in any external data source. You should be sure that all data external to your program is safe for use within the program. If an RSS feed contains malicious data, using it within your program puts the program and the system at risk.

C H A P T E R 8 P E R L A N D R S S

163

Summary

This chapter dealt with RSS feeds through Perl, covering both creation and consumption of RSS feeds. Specifically, you saw how to parse and write RSS using the XML::RSS module. Other modules for parsing and writing RSS with Perl, such as XML::RSS::Feed, are available.

When parsing an RSS feed, you create a new RSS object to parse an RSS file. Retrieval of the RSS file is left for another module. In this chapter, you saw how to use LWP::Simple to retrieve an RSS feed from an Internet site, but you can use any means to get an RSS feed into the parser, including using a local file. A local file is recommended when developing and debugging the RSS feed, so that the site operator doesn’t misinterpret the repeated retrieval requests.

Many of the same methods are used for both parsing and writing an RSS feed. You can choose and change the version for writing RSS by specifying it at instantiation time or with the output attribute.

This and the previous chapter have both touched on XML-related services in one form or another and provided a good introduction into XML applications in the real world. In the next chapter, you’ll finally look at straight XML parsing with Perl.

C H A P T E R 9

■ ■ ■

XML Parsing with Perl

You have some data in XML. Maybe that data is from a SOAP web service, maybe it’s from an RSS feed, or maybe it’s from another source. Now you want to read the XML and extract the data from it. As is the theme with Perl, you have multiple ways to accomplish this task.

XML parsing with Perl has a storied history. Early modules were quirky, while others were incomplete.

Parsing simple XML with Perl is, well, simple. Parsing complex XML with Perl can be quite difficult. The important thing to remember is that XML is just a way to represent data. That data happens to be in an XML document. The program that you write to parse XML will first need to read the XML, and then use the results as it would any other data input.

This chapter looks at XML parsing with Perl. It first reviews the main parsing methods, and then describes using two modules: XML::Simple and XML:SAX. Finally, it examines treebased parsing.

XML Parsing Methods

Recall that there’s always more than one way to do the same thing with Perl. XML parsing is no different. And, of course, there’s no rule that says that you must use an XML parser at all. It’s quite possible for you to write your own XML parser, just as it would be possible to write your own module for anything in Perl, rather than using an already existing module.

Primarily, two methods exist for parsing an XML document:

Stream parsing: Stream-based parsers process XML as it is read into the parser. As new elements are encountered (which are called tokens), they are processed by the parser and sent into your program through a process of events. This means that the program must process each piece of data as it is encountered by the stream-based parser. Stream-based parsers have lower memory requirements than their tree-based counterparts, simply because they don’t store any data; rather, they send data along into the rest of the program as it is found. Of course, the lower memory requirements are gained at the expense of complexity when compared to tree-based parsers.

Tree parsing: Tree-based parsers load entire XML structures into memory for later processing. This means that the entire document is parsed prior to your program needing to handle it. In turn, this leads to less complex programs when compared to stream-based parsing. The extra simplicity comes at the cost of higher memory requirements. Naturally, on a modern computer with a small document to parse, the memory required will be minimal.

165

166 C H A P T E R 9 X M L PA R S I N G W I T H P E R L

At their core, all XML parsers are stream parsers. It’s just that some build a tree structure on top of the stream automatically for you. XML::Parser is an example of an early (although still useful) stream parser in Perl. XML::SAX, or the Simple API for XML, is another streambased implementation, which will be covered later in this chapter.

XML Parsing Considerations

The following are some important general reminders and caveats for working with XML in Perl:

When you build a program to parse XML, it really exists only for that XML. This means that the program to parse the XML will invariably be largely one-time-use code.

There are many Perl modules for parsing XML and assisting with XML work. Regardless of which module you use, it is expected that the XML used as input will be well-formed. The XML parsing modules will likely produce wonky results—if they produce any results at all—when presented with poorly formed XML.

Not all XML modules can handle all aspects of XML such as namespaces, entities, and declarations, or at best, they don’t all handle those objects the same way. It’s important to make sure the output is correct, rather than just looks correct. A small and subtle change to the XML input could break the program if a module is being used incorrectly.

Spacing and character encodings are important items to consider when parsing XML. It’s possible for white space or unfamiliar or unexpected character encodings to cause unexpected results.

Parsing XML with XML::Simple

XML::Simple is an example of a tree-based XML parser, which is, well, simpler to use than other XML parsers. XML::Simple has just two subroutines: XMLin() and XMLout(). XMLin() is used to read an XML structure into an in-memory hash. The source of this XML is usually a string or file. From the XMLin() subroutine comes a reference to a hash. XMLout() creates XML when passed a reference to a hash that contains an encoded document.

Consider this bit of XML:

<?xml version="1.0"?> <customer-data> <customer>

<first_name>Frank</first_name> <last_name>Sanbeans</last_name> <dob>3/10</dob> <email>frank@example.com</email>

</customer>

<customer> <first_name>Sandy</first_name> <last_name>Sanbeans</last_name> <dob>4/15</dob> <email>sandy@example.com</email>

</customer>

C H A P T E R 9 X M L PA R S I N G W I T H P E R L

167

This XML is saved in a file titled example1.xml. The code to parse this XML structure is as follows:

#!/usr/bin/perl

use strict;

use XML::Simple;

my $xml = XMLin('./example1.xml',forcearray => 1);

foreach my $customer (@{$xml->{customer}}) { print "Name: $customer->{first_name}->[0] "; print "$customer->{last_name}->[0]\n"; print "Birthday: $customer->{dob}->[0]\n";

print "E-mail Address: $customer->{email}->[0]\n";

}

The code begins with the familiar use strict pragma, and then imports XML::Simple into the namespace:

use XML::Simple;

The XMLin() subroutine is called using the name of the file and setting the forcearray option. The XMLin() subroutine returns an array reference, which is what the forcearray => 1 option does:

my $xml = XMLin('./example1.xml',forcearray => 1);

Next, the array reference is dereferenced into its components. In the sample XML, each element is broken into a customer element at its base with a number of other elements below. Each of these elements is called and printed in turn within the foreach loop:

foreach my $customer (@{$xml->{customer}}) { print "Name: $customer->{first_name}->[0] "; print "$customer->{last_name}->[0]\n"; print "Birthday: $customer->{dob}->[0]\n";

print "E-mail Address: $customer->{email}->[0]\n";

}

This program is rather simple and does nothing more than print out each element listed for both customers in the file. Obviously, you could expand this to perform additional functions within the foreach loop. The output looks like this:

Name: Frank Sanbeans

Birthday: 3/10

E-mail Address: frank@example.com

Name: Sandy Sanbeans

Birthday: 4/15

E-mail Address: sandy@example.com

168 C H A P T E R 9 X M L PA R S I N G W I T H P E R L

The code shown uses the forcearray option, which isn’t really necessary. The XML being parsed in this example consists of solely single values—each customer record has one and only one value for date of birth, e-mail address, and so on. Another method to parse this particular XML looks like this:

#!/usr/bin/perl

use strict;

use XML::Simple;

my $xml = XMLin('./example1.xml');

foreach my $customer (@{$xml->{customer}}) {

print "Name: $customer->{first_name} $customer->{last_name}\n"; print "Birthday: $customer->{dob}\n";

print "E-mail Address: $customer->{email}\n";

}

The difference between this and the previously shown code is subtle. Missing from this example is the reference to the first element in the array ->[0]. When parsing XML with multivalued elements, accessing those elements with forcearray makes access to the elements much easier, as you’ll see a bit later in the “XML::Simple Options” section.

Data::Dumper

An even simpler, though arguably less useful method, for parsing an XML file with XML::Simple is to use Data::Dumper. You can use the Data::Dumper module to quickly print out the XML as it is being read and processed by the XMLin() subroutine. Doing so helps during debugging and in other cases, such as working with databases. Rather than using the foreach loop in the previous example, you could use Data::Dumper to print the contents of the XML, as shown in this example:

#!/usr/bin/perl

use strict;

use XML::Simple; use Data::Dumper;

my $xml = XMLin('./example1.xml',forcearray => 1);

print Dumper($xml);

Compare the output from the earlier example with the output from the Data::Dumper version of the program:

$VAR1 = {

'customer' => [

{

'email' => [ 'frank@example.com'

C H A P T E R 9 X M L PA R S I N G W I T H P E R L

169

'dob' => [ '3/10'

], 'last_name' => [

'Sanbeans'

], 'first_name' => [

'Frank'

]

},

{

'email' => [ 'sandy@example.com'

], 'dob' => [

'4/15'

], 'last_name' => [

'Sanbeans'

], 'first_name' => [

'Sandy'

]

}

]

};

Notice that each element in the Data::Dumper version is placed into its own array. This was the result of enabling the forcearray option. Contrast that output with a call to XMLin() with forcearray disabled:

$VAR1 = {

'customer' => [

{

'email' => 'frank@example.com', 'dob' => '3/10',

'last_name' => 'Sanbeans', 'first_name' => 'Frank'

},

{

'email' => 'sandy@example.com', 'dob' => '4/15',

'last_name' => 'Sanbeans', 'first_name' => 'Sandy'

}

]

};