Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Beginning Regular Expressions 2005.pdf
Скачиваний:
95
Добавлен:
17.08.2013
Размер:
25.42 Mб
Скачать

Regular Expression Tools and an Approach to Using Them

Suppose that you use a simple regular expression to look for a word such as ball; then it will always match.

.*

Because that regular expression, simply stated, says, “Match zero or more alphanumeric characters,” it will match any word or sequence of characters. It has 100 percent sensitivity in that it will always match any occurrence of the word ball but essentially 0 percent specificity because it will match a host of undesired words.

Toward the other end of the spectrum, suppose that you want to match all occurrences of all forms of the word ball, including the plural form, balls, and the possessive form, ball’s. In that situation, using the following regular expression pattern will match only the singular form ball if it is in a context where only whole words are matched.

ball

Or it may match the first four letters of balls and ball’s, which may not be what you want.

To be able to discuss this topic in detail, you will need to understand more about regular expression syntax so that you can try out various options and show the effect of choices that you might make in designing your regular expression patterns.

Create Appropriate Regular Expressions

Once you have given careful thought to precisely what it is you want to do and have studied the data source sufficiently to give you a good understanding of what it contains, you are in a good position to create regular expression patterns appropriate to your needs. There is no magic formula that is appropriate for all situations. Only you, the developer, can decide precisely what you want to match and want not to match. To get the desired results you may need to carry out text manipulation in two steps. However, often you will be able to carry out a match or replace in one step by combining regular expression constructs.

Document All but Simple Regular Expressions

If you are creating regular expression patterns that go beyond simple patterns, I suggest that you seriously consider documenting the regular expression. Why do that? Think about the possible situation in 6 or 12 months’ time when you come back to your code, perhaps because it isn’t behaving exactly as users expect, and you can’t decipher the precise purpose of the regular expression. It’s in these situations that the truism that regular expressions can be difficult to decipher becomes very real. The existence of clear and complete documentation can prevent a lot of wasted time and frustration.

Several of the language-specific chapters later in this book will show you how you can document regular expression code. Some languages (for example, Perl) allow you to specify a mode that enables you to include inline comments about your regular expressions. Documenting each component of a regular expression pattern in that way makes it much easier to follow the intentions of the original developer

and either spot any flaws in the approach or analysis, or adapt it more easily to an altered business need.

35

Chapter 2

When you are using regular expressions interactively (such as in Microsoft Word or OpenOffice.org Writer), it makes little sense to document a regular expression, in part because typically when you use regular expressions (wildcards) interactively in a word processor, you use fairly simple regular expressions, and in part because a word processor doesn’t provide any standard way for you to document the regular expressions you use.

When creating more complicated regular expressions, there are three aspects of the regular expression that I suggest you consider documenting:

What you expect the regular expression to do

What you want to select

What you want not to select

The more complex the problem you are seeking to define and the more complex the regular expression pattern(s) you create, the more likely it is that you will want to take time to document each of these aspects.

Each of these aspects is discussed in the following sections.

Document What You Expect the Regular Expression to Do

Documenting the intention of a regular expression is useful particularly when you are creating a regular expression that is intermediate or higher in complexity. Of course, your perception of what is advanced or complex will change as your experience and skills in writing and interpreting regular expressions increase. You may well find, like many other users of regular expressions, that your intuitive feel for a regular expression and what you intended it to do fall off severely after a few weeks or months. To minimize the effects of that falloff in understanding, it is better to err on the side of too much documentation rather than too little.

How might you document a regular expression? Let’s return to the Star Training Company example. Depending on how you iterate through refinement of the problem definition, you may also find that you need to refine the documentation comments you include in your code.

I will make the assumption that for this project, you are working in Visual Basic .NET. A first attempt at documentation might look like this:

‘Replace Star with Moon

At first sight, this seems straightforward but, as you saw in Chapter 1, it is documenting an approach that can result in a messed-up document when it is applied without due attention to refining that initial thought.

Another attempt at defining what ought to be done might be the following:

‘Replace Star with Moon when it occurs as a whole word but leave Star

‘unchanged when it occurs as part of a word.

If you iterate through the problem definition several times and choose to create the documentation comments early in that iteration process, be sure to update the documentation comments when you make any changes to the regular expression pattern. If you forget to update the documentation comments, you

36

Regular Expression Tools and an Approach to Using Them

can end up with documentation comments describing something that you no longer intend to do. And that, in my experience, is one of the few situations where having documentation can be worse than having no documentation at all.

In more formal or more complex situations, you may also want to create paper documentation, as part of the documentation of a project, that describes in detail what the regular expressions were intended to do.

Document What You Want to Match

Express as precisely as you can what patterns of characters you want to match. The more formally you make it your habit to express this notion, the more likely you are to fully understand what it is that you really want to do.

Because the effect of a regular expression is to match some sequence of characters, it can be helpful to spell out precisely what sequence(s) of characters it is that you intend to match.

So continuing with the fictional Star Training Company example, you might add these comments to the code:

‘Match Star each time it occurs as part of Star Training Company ‘Match Star when it is standalone but refers to Star Training Company

‘for example in phrases like “Star is the best” ‘Match any occurrence of the possessive form Star’s

After you document clearly what your aim is, you are in a better position to create a regular expression pattern that does exactly what you intended it to do.

Document What You Don’t Want to Select

This may seem the oddest part of the suggested process because, by definition, the text that you don’t want to match is probabl — at first sight, anyway — of least interest to you. However, making mistakes so that you match and change undesired text can give you “moontling” results, as you saw with the Star Training Company example, where the word startling was inappropriately replaced by the sequence of characters Moontling due to a search and replace that wasn’t sufficiently specific.

If the new recruit for Star Training Company had taken time to document words that he didn’t want to change (such as the following), the result of the search and replace might have been less obviously bad:

‘Don’t match any occurrence of words like start, startling

The better you understand your data source and the effects of regular expression patterns that you are considering creating, the more specific you can be in your comments.

Use Whitespace to Aid in Clear Documentation of the

Regular Expression

In several languages, such as Perl, you can spread a regular expression over several lines. That allows you to use whitespace intelligently together with comments for each logical component of the regular expression, so that you achieve a much clearer set of comments. Because each part of the regular expression has its own comments, ambiguity is reduced or avoided.

37

Chapter 2

In some other languages (JavaScript is an example), you cannot use whitespace in this way because all JavaScript statements must be written on a single line. When writing complex regular expressions in a language such as JavaScript, I suggest you consider writing a copy of the regular expression pattern as components with explanation on the lines immediately following the regular expression pattern itself.

In the short term, it adds to your tasks, but it can save time because you are forced to make explicit what you are trying to do. Down the road, it can help you or another developer modify the code with a fuller understanding of the original objectives.

By adding comments in that fashion, you gain many of the benefits of the detailed documentation when patterns are split over several lines. The key difference is that in languages such as Perl and Java you can add comments inside the regular expression on live components of a pattern, whereas in JavaScript you are adding comments to a text copy of components of the code, which is treated by the JavaScript interpreter as comments.

Of course, one risk of doing that is that the working copy of the regular expression is different from the componentized documentation copy.

Test the Results of a Regular Expression

When you are working on a single, fairly simple document, you probably don’t need to test a regular expression other than interactively. Most of the examples in this book, for practical reasons of space, use short, simplified documents. Depending on what you want to use a regular expression to do, you will probably find that you can often carry out regular expression matching interactively when using the examples from this book. That is no more than a simple form of testing. Does the pattern select what you want and avoid selecting undesired matches? Then use it there and then.

However, that interactive approach doesn’t scale. When you are using regular expressions on dozens, hundreds, or perhaps hundreds of thousands of documents or documents that may be many megabytes, you want to be sure that you don’t create a mess like the one shown in the Star Training Company example in Chapter 1, but on a much larger scale.

Therefore, it makes a lot of sense to carefully test a complex regular expression on some appropriate test data to make sure that you find all the character sequences that you want to find and that you don’t inadvertently change character sequences that you don’t want to be changed.

The short sample documents in this book may give you ideas about the kind of test documents you will need to create. But simply copying example documents from this book will very likely not work. You must carefully consider the data that you want to select and possibly also similar data that you want to be sure not to select. Only careful thought about the actual data that you are processing and what changes you want to make will allow you to craft a really useful test document.

If you follow the steps suggested here about how to handle nontrivial regular expression tasks, you should be in good shape to create test documents that are relevant and helpful. Because you should also have a good understanding of the desired text manipulation task, you should be able to take your problem definition and translate that into an appropriate and accurate regular expression pattern to achieve what you want. Testing that regular expression on carefully chosen test data will give you confidence that the large-scale manipulation of textual data will also succeed.

38

Regular Expression Tools and an Approach to Using Them

When you are intending to manipulate large amounts of data, always make sure that you have good backups of the data. If you plan the text manipulation task well, everything ought to go smoothly, and you are unlikely to need to make use of the backups. But undoing an incorrectly designed regular expression task that has been carried out on many megabytes of data is not a desirable situation to get yourself into.

Having backups is one thing. Having backups that you know can be used to restore data is another. The only way you can be totally sure that your backups work is to test that they can be read and used to restore a configuration. That should be a routine quality assurance procedure for valuable data. It is too late to find out that your backups don’t work at the time when you really need them to rescue you from some disaster, whether caused by badly crafted regular expressions or some other cause.

39

3

Simple Regular Expressions

This chapter takes a closer look at some basic aspects of constructing some simple regular expressions. One reason for working through the simple regular expressions examined in this chapter is to reinforce the approach described in Chapter 2 and look at how it can be applied to fairly simple regular expressions.

The examples used are necessarily simple, but by using regular expressions to match fairly simple text patterns, you should become increasingly familiar and comfortable with the use of foundational regular expression constructs that can be used to form part of more complex regular expressions. Later chapters explore additional regular expression constructs and address progressively more complex problems.

One of the issues this chapter explores in some detail is the situation where you want to match occurrences of characters other than those characters simply occurring once.

This chapter looks at the following:

How to match single characters

How to match optional characters

How to match characters that can occur an unbounded number of times, whether the characters of interest are optional or required

How to match characters that can occur a specified number of times

First, let’s look at the simplest situation: matching single characters.