Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Beginning Regular Expressions 2005.pdf
Скачиваний:
95
Добавлен:
17.08.2013
Размер:
25.42 Mб
Скачать

Chapter 10

The creation of a suitable test case depends on understanding your data source. To test whether or not you detect all references to Star Training, you might include lines with words such as the following:

Star Training

Star.

Star?

And to ensure that you don’t match undesired character sequences, also include text such as the following:

Star performer

Starting from the beginning

Making sure that you succeed in matching desired text and avoid matching undesired text will give you increased confidence in your pattern if it behaves as expected on the test data or reveal problems in your approach if desired matches fail to match or undesired matches do match.

When your pattern doesn’t behave as expected, first look at the unexpected results to see if you can quickly spot why the behavior differs from the results you expect. In examples in earlier chapters, I stepped through explanations character by character. If you understood those, it should help you interpret what you expect to happen with your own patterns. If that analysis doesn’t work, you may need to go back to the beginning of the process and create a problem definition which you then refine, as well as invest more time in understanding the data source.

Debugging Regular Expressions

The first thing to say about debugging regular expressions is that you should avoid it if at all possible. Debugging regular expressions can be time-consuming and intensely frustrating.

The more time you invest in stepping through the refinement of a problem definition, the clearer your ideas of what you need the regular expression to do should be. If you also clearly document in your code what you expect each component of the regular expression pattern to achieve, you should substantially reduce the number of times you have thoroughly puzzling behavior from your regular expression code.

However, even when your code is thoroughly thought out, a few problems can crop up.

Treacherous Whitespace

Whitespace is a treacherous commodity in regular expressions; it can be so difficult in modern, highresolution monitors to be sure whether a whitespace character, particularly a single space character, is in the pattern or not. It can also occur when there is uncertainty about whitespace in relevant parts of the test text.

248

Documenting and Debugging Regular Expressions

Problems due to whitespace can occur both when expected whitespace is missing from the pattern and when unexpected whitespace characters are present.

A common error by relatively inexperienced regular expression programmers is including a space character next to the pipe character (|), which separates options in a regular expression. Superficially, it makes the pattern easier to read, but at the cost of changing the meaning of the pattern.

If you are using extended mode, which was described earlier in this chapter, any whitespace characters in your pattern will be ignored. So if your pattern requires you to match a character sequence that depends on whitespace characters, such as a space character, you must specify the whitespace character(s) using metacharacters such as \s, which matches any whitespace character.

The file JimOrFred3.pl has an unwanted single space character after the pipe character in the following line:

my $myPattern = “^(Jim| Fred)\$”;

Otherwise, JimOrFred3.pl is identical to JimOrFred.pl. The effect of that single whitespace character is that Jim still matches, but Fred does not, because the regular expression engine is trying to match the pattern space character followed by the character sequence Fred, which wasn’t entered by the user, so matching fails. Figure 10-2 shows the character sequence Fred failing to match.

Figure 10-2

If you don’t have Perl installed on your development machine, visit Chapter 26 and review the download and installation information there if you want to run this code.

Try It Out

Basic Alternation Example

This example asks the user to type his or her first name and then displays a message depending on whether or not the name the user entered was recognized by this very simple system.

The code uses simple alternation (Jim| Fred) to accept the name Jim or the name Fred as user input. The positional metacharacters ^ and $ are also used to specify that no input other than the desired choice of first name will be matched.

1.Type the following Perl code into your favorite text editor, or use the file JimOrFred3.pl in the code download for this chapter.

249

Chapter 10

#!/usr/bin/perl -w use strict;

print “This program will say ‘Hello’ to Jim or Fred.\n”; my $myPattern = “^(Jim| Fred)\$”;

print “Enter your first name here: “; my $myTestString = <STDIN>;

chomp ($myTestString);

if ($myTestString =~ m/$myPattern/)

{

print “Hello $myTestString. How are you today?”;

}

else

{

print “Sorry I don’t know you!”;

}

2.Run the code at the command line, using the command perl JimOrFred3.pl.

3.At the prompt, enter Jim, and press the Return key.

4.Inspect the displayed result.

5.Run the code again, and enter Fred; then press the Return key.

6.Inspect the displayed result. (Figure 10-2 shows the appearance after this step.)

How It Works

When the pattern is ^(Jim| Fred)$, the character sequence Jim matches because that character sequence is the one that precedes the pipe character. However, after the pipe character, the required character sequence is space character, then Fred. Unless the user types a space character, then Fred, there will be no match.

You may also be wondering about the \$ in the pattern in JimOrFred3.pl. That issue is discussed in a moment.

Intermittent problems can also occur due to whitespace characters. One possibility is caused by the user, not the developer.

Suppose that you run JimOrFred.pl again, which as you saw in Figure 10-1 matches both the character sequences Jim and Fred. However, Fred may phone you up, telling you that he is locked out of the program when he attempts to log in. What might be happening is that he types Fred Schmidt and then deletes the Schmidt but leaves the space character. That won’t match, because the space character is not allowed. He can send you a screen shot, like that shown in Figure 10-3, which shows the failed login.

Figure 10-3

250

Documenting and Debugging Regular Expressions

Admittedly, that example is a little forced. The point that is important to take away is that user actions can be odd. If you don’t code to take those actions into account, you can have an intermittent problem that you never track down, because there is nothing “wrong” with your code, except that it didn’t allow for the user doing something unexpected some of the time.

Backslashes Causing Problems

In some settings, the omission or addition of a backslash character can change the meaning of your regular expression pattern. If the pattern is attempting to match a character sequence that is different from the one you want to match, you will get different matches from those you expected.

In Perl, one such situation occurs when you use the $ metacharacter. Notice that when a value was assigned to the $myPattern variable in JimOrFred3.pl, it was written as follows:

my $myPattern = “^(Jim| Fred)\$”;

In this situation, omitting the backslash will mean that your code won’t compile. However, the need to use \$ in this setting to specify the $ metacharacter can be confusing.

However, in other situations you may find that a lookahead or lookbehind fails but without any compilation errors. You may intend the regular expression engine to match a metacharacter, while it is attempting to match a character instead. The result is a puzzling failure to match when you are sure that the pattern is correct.

Considering Other Causes

Complex regular expressions undoubtedly have significant potential for producing unexpected and undesired results. However, the complexity and cryptic nature of a lengthy regular expression pattern should not blind you to the possibility that the cause of undesired results is a flaw somewhere else in your analysis or code.

The range of possible problems depends on what you are doing with your code. Just keep in mind that problems with regular expression code are a complex interaction between the regular expression pattern, the data source, and the surrounding code. Each possibility needs to be examined in a systematic way if the problem persists.

251