Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Beginning Perl Web Development - From Novice To Professional (2006)

.pdf
Скачиваний:
56
Добавлен:
17.08.2013
Размер:
2.9 Mб
Скачать

100 C H A P T E R 5 LW P M O D U L E S

you have a browser object, $browser, already created. The following code would create a duplicate of that browser object:

$browser2 = $browser->clone();

Submitting a Web Form

Two HTTP methods are used to pass form variables into a script on the web: GET and POST. Using GET, the parameters are passed as part of the URL itself in name=value pairs. This type of submission using the LWP is rather trivial and can be accomplished in a number of ways through various GET methods, as you’ve already seen in this chapter.

However, even though GET is the most commonly used method, the POST method is also frequently used, especially when working with web forms or web services.

Using GET, any parameters passed into a CGI application are passed via the URL. This can be problematic for three main reasons:

Some browsers and servers limit the length of the URL, thus making complicated parameter passing more difficult.

All characters in the URL must be encoded in order to be safe for URLs.

Parameters passed on the URL are visible to anyone listening, regardless of whether or not SSL (HTTPS) is used.

In contrast, using POST, all of the parameters are passed as part of the message body. This alone effectively removes all three problems with GET. Parameters passed via POST aren’t limited by length, nor do they need to be encoded. And since the parameters are passed within the body, they are indeed encrypted when passed over SSL.

Using the LWP post() method, the name=value pairs are passed as an array—well, actually as a reference to an array, as you’ll see shortly.

When working with forms, there are a number of form elements that appear inside the <form></form> tags on the page. For example, assume a web form located at http:// www.example.com/form.cgi contains text boxes to fill in with information such as the user’s name, e-mail address, and zip code. The name=value parameters might look like this for a filledin form:

name=Steve Suehring email=suehring@braingia.com zip=54481

You can send these in a POST request through the LWP by placing them as arguments within the call to the post() method of the browser object, as shown here:

$result = $browser->post('http://www.example.com/form.cgi',

[

'name' => 'Steve',

'email' => 'suehring@braingia.com', 'zip' => '54481'

]);

To analyze a web form, the first task is to determine the URL of the target. This is defined m. From there, it’s a matter of determining

C H A P T E R 5 LW P M O D U L E S

101

which parameters, if any, are required, and the corresponding values for them. Of course, since this is Perl, it’s common to substitute variables for the parameter values themselves. So instead of hard-coding the zip code, you might want to set $zip as a scalar variable that changes for the web form. Naturally, what you do with the POSTed data is up to you and the form itself.

Handling Cookies

As explained in Chapter 1, cookies are used by web sites to track state and other information about the visiting browser or user agent. It’s up to you to work with the cookies that are set and expected by the web site.

The LWP’s cookie_jar attribute is used with sites that set and read browser cookies. Using the cookie_jar attribute, you can store cookies both in memory or out to a file. When I monitored a site to win a gaming console (as I described earlier in the chapter), I used an existing cookie store. Since that site required authentication using a cookie, I was able to use the cookies file from Firefox to successfully authenticate to the site from within the script.

The cookie_jar attribute can read cookies based solely in memory, or it can use cookies in a file. If the cookies are based in memory, they exist only as long as the life of the user agent object created within the program itself. If the cookies are based in a file, they become persistent and can be saved and read between multiple user agents and multiple executions of the program itself.

You can create a temporary cookie store in memory by invoking the HTTP::Cookies object. For example, assume a browser object named $browser. Creating a memory-based cookie store would look like this:

$browser->cookie_jar(HTTP::Cookies->new);

On the other hand, using a file would look like this:

use LWP;

my $browser = LWP::UserAgent->new( ); my $cookie_jar = HTTP::Cookies->new(

'file' => '/home/suehring/cookies.txt'

); $browser->cookie_jar($cookie_jar);

Handling Password-Protected Sites

Some sites require authentication through a username and password in order to sign in and use the resources found there. This authentication is provided or indicated by a 401 Authorization Required HTTP response. Normally, a dialog box prompting for authentication pops up, as opposed to a username and password web page. The LWP includes attributes to work with sites that use basic authentication. Using the credentials() method, you add these attributes to a given browser object programmatically. The credentials() method looks like this:

$browser->credentials('server:port','realm','username'=>'password');

For example, the site www.example.com has a subscribers area for which you must supply credentials. This site uses a realm of Subscribers.

102 C H A P T E R 5 LW P M O D U L E S

$browser->credentials('www.example.com:80', 'Subscribers',

'suehring' => 'badpassword');

Now when the $browser object is used to access a URL within the www.example.com domain that prompts for credentials, the credentials specified in the example will be sent.

The credentials themselves die at the end of the browser object’s life. You can store as many credentials inside a browser object as you need, based on the server name and realm name for the protected resource.

Mirroring a Web Site

Earlier in this chapter, I mentioned LWP::Simple’s mirror() function, as well as the lwp-mirror program. Both of these work well for mirroring an entire web site. The browser object also has a mirror() method that enables a site to be mirrored, while taking advantage of the extra power of the object’s interface.

The lwp-mirror program does an excellent job of mirroring a site in a sane, easy-to- understand manner. I recommend the lwp-mirror program for nearly all mirroring operations. lwp-mirror is called from your shell and accepts a URL and an output file as arguments:

lwp-mirror <url> <output_file>

Here is an example:

lwp-mirror http://www.braingia.org/ local_braingia_index.html

The mirror() method on the browser object has two requirements as well: the URL and the output file. Here is an example of using this method:

$browser->mirror('http://www.braingia.org','local_braingia_index.html');

Handling Proxies

Proxies are sometimes required to access Internet services. The LWP includes a set of methods for working with proxies that enable you to set a proxy for a given protocol or set of protocols. When a proxy is required on a given system, it’s not uncommon for it to be set among the different environment variables in the shell. The LWP can use the shell environment variable for proxy. A call to the env_proxy() method will look for environment variables that indicate the proxy server to use, such as http_proxy, as in the following example:

$browser->env_proxy();

It doesn’t hurt to call this method if nothing is set for the proxy environment variable—the proxy value for the browser object will still be empty.

The proxy() method accepts two arguments: the protocol and the actual proxy to use. Here is its format:

$browser->proxy(protocol, proxy_server);

C H A P T E R 5 LW P M O D U L E S

103

For this example, assume that you have a browser object called $browser and proxy server called proxy.example.com. If you want to set the HTTP proxy server for use within the program, the invocation of the proxy() method looks like this:

$browser->proxy("http","http://proxy.example.com");

It’s quite common for a proxy server to be used for URLs that are outside the local network. Inside the network, a proxy server should not be used. For these cases, the LWP includes a no_proxy() method that accepts a comma-separated list of domains for which no proxy server should be used. Assume that you have a server located at local.example.com for which you want direct access, as opposed to access through the proxy. The no_proxy() method call looks like this:

$browser->no_proxy("local.example.com");

Calling no_proxy() with an empty list clears out the list of hosts:

$browser->no_proxy();

Removing HTML Tags from a Page

As you’ve undoubtedly seen if you’ve followed the examples in this chapter, the content that comes back from a GET request is the raw, uncensored HTML (and other language) content from the web server. To say that this is difficult for a human to read and interpret is an understatement. Unfortunately, there is no surefire method for extracting the useful text from a web page. However, you have some options for retrieving the text from a page.

For example, Listing 5-6 shows the Get.pl example shown earlier in the chapter, but modified to use HTML::FormatText to produce output that is more human-friendly.

Listing 5-6. Using HTML::FormatText to Retrieve the Text from a Page

!/usr/bin/perl -w

use strict;

use HTML::TreeBuilder; use HTML::FormatText; use LWP::Simple;

my $webpage = get("http://www.braingia.org/");

my $htmltree = HTML::TreeBuilder->new->parse($webpage);

my $output = HTML::FormatText->new(); print $output->format($htmltree);

104 C H A P T E R 5 LW P M O D U L E S

Recall that when run before, the output from web page retrieval looked like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/ DTD/xhtml11.dtd">

<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>

Now, rather than outputting the raw HTML and other bits, with the help of HTML::FormatText, the output is the actual text on the web page, which currently looks like this:

Braingia.org - Web Site for Steve Suehring

==========================================

Home | LinuxWorld Magazine | Google Current Work Software My Bookshelf Older Projects Webnotes Contact

Intarweb

As you can see, the output is much easier to read and parse by a human.

Tip Using the Lynx web browser with the -dump option also gets the text on the web page.

The easiest (or so it may seem) method for working with the text from a web page is by using regular expressions. Since HTML and other languages are known entities, it’s almost always possible to work up a regular expression to extract the text that you need.

There are also Perl modules to assist with the extraction of text from web pages. The aptly titled HTML::Parser module along with HTML::Tokenizer serve this purpose. These modules can be quite cumbersome to use though, and are highly specialized at that. The WWW::Mechanize Perl module provides a good interface to enable browsing through a Perl program as well.

Both regular expressions and the parsing modules have their limitations. Regardless of which method you choose, each page that you need to parse will be unique and offer its own set of challenges.

Security Considerations with the LWP

When working with the LWP, you must take extra caution to not cause unnecessary traffic. It’s quite easy to begin a mirror process and consume a lot of disk space or network bandwidth. In addition, the administrators of the site being mirrored might think the site is under attack and take action accordingly.

C H A P T E R 5 LW P M O D U L E S

105

Obviously, the same rules that apply to other Perl programming apply when you’re using the LWP. Don’t run as root unless absolutely necessary, be mindful of what you’re doing so you don’t overwrite files, and so on.

If you’re allowing uploads through web forms, pay special attention to where those files are uploaded to and what the user can do with those files once uploaded. Numerous attacks have begun through a file-upload interface.

Summary

This chapter looked at some forms of interaction between a Perl program and the Internet using the LWP modules. You saw how to set up a Perl-based browser, along with attributes such as the user agent. You retrieved web pages and also learned about the GET and POST methods.

More Internet interaction through Perl is on the way in the next chapter. Where this chapter focused primarily on the LWP and web interaction, the next chapter will expand into other protocols, such as POP3, SMTP, and others.

C H A P T E R 6

■ ■ ■

Net:: Tools

The things that a programmer can do with Perl never cease to amaze me. The area of network programming is no exception. Of course, it’s quite possible to get down and dirty with Perl and write your own network servers and clients. I find this to be rather enjoyable, which should tell you something about me. However, sometimes I value simply getting the work done, rather than the process of writing low-level client/server code. Truthfully, that’s most of the time. There’s no need to reinvent the wheel when it comes to working with Simple Mail Transfer Protocol (SMTP), Domain Name System (DNS), Post Office Protocol version 3 (POP3), Internet Control Message Protocol (ICMP), Lightweight Directory Access Protocol (LDAP), and other networking protocols.

This chapter takes a look at some of the tools available to the Perl programmer for working with various Internet protocols (aside from HTTP): POP3 and SMTP for working with e-mail, DNS, and ICMP for ping. These are just a few of the numerous Net:: modules available. For example, other Net:: modules allow you to query an LDAP directory (and interoperate with Microsoft’s Active Directory), query the whois database of domain names, work with FTP, and more. The libnet tools on CPAN (http://search.cpan.org/~/libnet-1.19/) provide a listing of some of these tools.

Checking E-Mail with Net::POP3

POP3 (defined by RFC 1939) is a popular protocol used to check e-mail. It’s used to retrieve e-mail from a server, typically at an Internet provider, where the e-mail is stored or spooled. When you check your e-mail, a username and password are sent to the server, and the server sends back

a list of messages and, optionally, the e-mail content itself. POP3 is not a protocol to send e-mail; that’s SMTP. POP3 is used only to retrieve e-mail that’s being stored on a POP3 server.

The Net::POP3 module is the primary module used to check e-mail with the POP3 protocol. However, in the tradition of Perl, there are several packages available that can work with e-mail. One such package is Mail::Box, which I’ll cover after the discussion of Net::POP3.

The Net::POP3 module is available with many Linux distributions and also from your favorite CPAN mirror. As with other modules, a use statement is the best way to import the Net::POP3 namespace into your Perl program:

use Net::POP3;

107

108 C H A P T E R 6 N E T: : TO O L S

Creating a POP3 Object

Like the browser object you encounter when working with the LWP modules (as described in the previous chapter), the Net::POP3 module works by creating a POP3 object. You create this object with a call to the new() method. The new() method returns a reference to the newly created object, which you’re likely to store inside a scalar variable. Here’s an example:

use Net::POP3;

$pop3conn = Net::POP3->new('mail.example.com');

The host, as provided in the example as mail.example.com, isn’t required when you call the new() method. If the host is not set when you call the new() method, it must be configured in Net::Config within the POP3_Hosts parameter. However, you’ll almost always define it in the program, as shown in the example.

Naturally, you can store the host inside its own variable. It’s common to do so by storing the host variable in the beginning of the program or getting it from an external source. For example, you might store the host in a scalar variable called $pophost. You then invoke the call to new() like this:

$pop3conn = Net::POP3->new($pophost);

Sometimes, the mail server is stored in an environment variable.1 It might be called MAIL_SERVER or POP3_SERVER. The name of the environment variable depends on your system; there is no set standard. Use the shell command printenv or export to see your environment variables. Alternatively, you can iterate through the environment variables from within your Perl program with the following code (introduced in Chapter 4):

foreach $key (keys %ENV) {

print "Environment key $key is $ENV{$key}\n";

}

Here’s an example that sets the POP3 host for the call to the new() method based on the environment variable, assuming an environment variable of POP3_SERVER:

$pop3conn = Net::POP3->new($ENV{POP3_SERVER});

The host can also be an array of POP3 hosts. If an array or list of hosts is given, the program will try each in turn. This is not a common scenario. Usually, the mail spool is stored on one server, and if there are multiple servers, the correct one is chosen automatically.

For the rest of this discussion, I’ll use the variable $pop3conn to refer to the Net::POP3 connection object created here.

Setting and Getting Other POP3 Connection Parameters

Four other parameters are available when you’re setting up a connection with Net::POP3. You can set any of the parameters when you create the connection object or later.

1.If an environment variable isn’t set, you could set one. However, I don’t see a particular advantage to doing so as opposed to just defining it within your program.

C H A P T E R 6 N E T: : TO O L S

109

When you set options at the time of connection object creation, they are set as name => value pairs. For example, to set the timeout parameter to 30 seconds on creation of the connection object, do this:

$pop3conn = Net::POP3->new("mail.example.com", timeout => 30);

To set more than one parameter, separate them with a comma:

$pop3conn = Net::POP3->new("mail.example.com", timeout => 30, debug => 1);

Let’s look at the Net::POP3 connection parameters host, timeout, ResvPort, and debug. The following examples shows setting options after the connection object has been created.

Host

The host must be set at creation of the POP3 object. You can find out the name of the current host for a given POP3 connection object by calling the host() method with no arguments:

$pop3conn->host();

Recall the example earlier in this section that created a POP3 connection to mail.example.com. Now consider this example that prints the name of the current host:

use Net::POP3;

$pop3conn = Net::POP3->new('mail.example.com');

print "The POP3 Server is" . $pop3conn->host() . "\n";

Timeout

The timeout value is the amount of time to wait for a response from the POP3 server. The default is 120 seconds. Like other parameters, it can be set at creation or set later by calling the option directly. This example sets the value to 30 seconds:

$pop3conn->timeout(30);

ResvPort

Don’t be confused by the ResvPort option. This option is used to set the local port from which connections will originate. It is not used to set the port of the server. ResvPort can be useful if you have a firewall that allows only certain ports as source ports, for example (though that would be quite an uncommon configuration).

Debug

The debug option can be a lifesaver when you’re having trouble getting the POP3 connection to work. When you set debug to 1, additional output is printed to STDOUT, including the actual POP3 conversation between the program and the server. Like other options, debug can be set at the time of object creation or later, within the program. The option is either disabled (default or 0) or enabled by setting the value to 1:

$pop3conn->debug(1);