Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Beginning Regular Expressions 2005.pdf
Скачиваний:
95
Добавлен:
17.08.2013
Размер:
25.42 Mб
Скачать

Chapter 9

But if you don’t really understand the data you are working with, even a regular expression with correct syntax can turn up unexpected results, by lowering either sensitivity or specificity.

Abbreviations

Abbreviations can pose significant potential for lowering the sensitivity of a regular expression. For example, titles such as Dr (with no period character) and Dr. (with a period character) are frequently used as abbreviations for Doctor. In some circumstances, you may be confident that only one form is used in the data source. If all three forms occur in the data, a pattern like the following will be necessary to avoid missing some desired matches:

(Doctor|Dr.|Dr)

Similar issues arise when handling data that includes information about qualifications. For example, if a Doctor of Philosophy degree is of interest, it will often be written as PhD (no space character or

period character), Ph.D. (two period characters), or Ph. D. (one space character, two period characters).

To match the options just mentioned, a pattern such as the following would be satisfactory:

Ph\. ?D\.?

It includes the \. metacharacter twice with a ? quantifier, which matches each of the optional period character(s) that can occur in some of the options. Depending on where the degree was obtained, the form D.Phil. (two period characters) with option DPhil (no period characters) can also occur. To allow for these additional forms, a pattern such as the following would be needed:

(Ph\.?D\.?|D\.?Phil\.?)

Characters from Other Languages

The focus of this book is the use of regular expressions with English, including U.S. English and British English. However, with the increasing globalization of trade, the inclusion of words and characters from other languages commonly occurs in documents that are, for the most part, written in English.

In Canada, many official documents are in French. Therefore, many characters with accents will be routinely encountered.

In documents written in English, there can be differences in how words are written. For example, the test text.

“Nostalgia is not what it used to be.” That is my favorite cliche.

might equally have been written as follows:

“Nostalgia is not what it used to be.” That is my favorite cliché.

234

Sensitivity and Specificity of Regular Expressions

The second version includes the acute character é just before the period, which concludes the sentence. To match both forms, you would need to use a pattern such as the following:

clich(e|é)

Foreign characters introduce other issues when they occur in HTML. The sample document, EAcute.html, uses the notation é instead of the literal character:

<h2>”Nostalgia is not what it used to be.” That is my favorite cliché.</h2>

Yet as you can see in Figure 9-8, the correct character is displayed on the Web page.

Figure 9-8

Foreign characters may be expressed as Unicode numbers inside some documents, adding another consideration to be remembered when attempting to match those characters.

Names

As people become more mobile, for example, in employment, it is likely that names originating in languages that don’t use Roman script will often form part of human resources data and so on. For example, the Indian first name that is sometimes spelled Saurav is also spelled Saurabh and, less commonly, Surav. The pattern Saurav would fail to match the latter two spellings, although quite possibly, it would be your intent to match all occurrences of the name. To match all spellings, you would need a pattern such as the following:

Sa?ura(v|bh)

Similar considerations apply in other foreign names. The Russian name for Peter, sometimes transliterated as Pyotr, may also be found spelled as Petr or Pëtr, or even translated as Peter, and may need to be matched in all instances. To match all these possible forms of the name, you might use a pattern like this:

P(yo|e|ë)tr

235

Chapter 9

Some European surnames have variant spellings too. For example, the surnames Van Nistelrooy (with an intermediate space character) can also be spelled Van Nistelrooij or VanNistelrooy (with no intermediate space character). So a pattern such as the following would be needed to match these three spelling variants:

Van *Nistelroo(ij|y)

Of course, because some such surnames may sometimes be spelled with a lowercase v in van, the following pattern might be more sensitive in some situations:

[vV]an *Nistelroo(ij|y)

Sensitivity and How to Achieve It

To achieve maximum sensitivity, you must be aware of all the variant character sequences that can be used to express the character sequence that you want to match.

Each time you add some component to a pattern that makes it more specific, you need to carefully consider whether, given the data you are working with, it might also cause some desired matches to fail.

Specificity and How to Maximize It

Conceptually, the way to maximize specificity is to make the regular expression as specific as possible. There are many techniques to cut out unwanted matches, several of which have been discussed earlier in this chapter.

When attempting to maximize specificity, it is important to give careful consideration to situations that you don’t want to match and constructing a pattern that excludes those unwanted character sequences from matching. Achieving high specificity involves having an understanding of regular expression syntax and the effects of the techniques available to you, and understanding how those techniques affect the data you are working with.

Revisiting the Star Training Company

Example

In Chapter 1, you looked at an example that posed a challenge to a new recruit to the fictional Star Training Company. Having learned a range of techniques in Chapters 2 through 7, you are now in a much better position to avoid many of the pitfalls that occurred when a simple find and replace was attempted in Chapter 1.

For convenience, the sample text, StarOriginal.txt, is reproduced here:

236

Sensitivity and Specificity of Regular Expressions

Star Training Company

Starting from May 1st Star Training Company is offering a startling special offer to our regular customers - a 20% discount when 4 or more staff attend a single Star Training Company course.

In addition, each quarter our star customer will receive a voucher for a free holiday away from the pressures of the office. Staring at a computer screen all day might be replaced by starfish and swimming in the Seychelles.

Once this offer has started and you hear about other Star Training customers enjoying their free holiday you might feel left out. Don’t be left on the outside staring in. Start right now building your points to allow you to start out on your very own Star Training holiday.

Reach for the star. Training is valuable in its own right but the possibility of a free holiday adds a startling new dimension to the benefits of Star Training training.

Don’t stare at that computer screen any longer. Start now with Star. Training is crucial to your company’s wellbeing. Think Star.

The problem definition can be expressed as follows:

Match all occurrences of the character sequence S, t, a, and r when that character sequence refers to the Star Training Company. Replace each occurrence of the preceding character sequence with the character sequence M, o, o, and n.

The objective is to replace all references to the fictional Star Training Company with corresponding references to the equally fictional Moon Training Company.

When faced with a task like this in real life, it can be helpful to view a few sample documents in a text editor or word processor with search facilities. That allows you to enter a pattern to look for occurrences of character sequences that might be relevant. In this case, you can use the simple literal pattern star (all lowercase) and use regular expressions matching in a case-insensitive way.

Try It Out

Replacing Star with Moon

1.Open the file StarOriginal.txt in OpenOffice.org Writer.

2.Open the Find & Replace dialog box using Ctrl+F.

3.Check the Regular Expressions check box, but leave the Match Case check box unchecked, because you want to find all occurrences of the specified pattern in a case-insensitive way.

4.Type the pattern star in the Search For text box, and click the Find All button.

5.Inspect the matches shown in Figure 9-9, paying careful attention to any occurrences of the character sequence star that refer to the Star Training Company.

237

Chapter 9

Figure 9-9

How It Works

The matching in the preceding example is straightforward and matches all occurrences of the literal character sequence that constituted the pattern.

OpenOffice.org Writer is convenient to do this on single documents. However, when dealing with multiple documents, a tool such as PowerGrep will allow you to look for matches in several documents at the same time, highlighting each match for your convenience. This can save a lot of time in getting a feel for the data that you have to manipulate.

Let’s take time to list the character sequences that you want to match. You want to match star in the following:

Star Training

Star.

238

Sensitivity and Specificity of Regular Expressions

You want to avoid matching star in the following character sequences:

Starting startling star customer Staring starfish started

Start right start out star. startling stare Start now

I don’t routinely take time to lay out desired matches in a list and undesired matches in a second list. But particularly when you need to get things as close to 100 percent sensitivity and 100 percent specificity as possible, it makes a lot of sense to make lists like this.

Splitting character sequences into desired matches and undesired matches can be really helpful in working out how sensitive and specific any pattern will prove to be.

If you decide that a lookahead is the way to proceed (as it probably is), you could try to match all desired matches using the following pattern:

Star(?= Training)

However, if you look at the list of desired matches, you can see immediately that the preceding pattern will fail in a sentence such as Think Star. That’s one of the occurrences of Star followed by a period character.

The following pattern, which offers alternation of two lookaheads, fits all the desired matches that you have seen in the sample text:

Star((?= Training)|(?=\.))

Thus, as judged by the sample text, you have 100 percent sensitivity. Figure 9-10 shows the preceding pattern being tested against the character sequence Star.

It is always wise to consider that the test data you have looked at doesn’t hold all the likely or possible character sequences that you need to think about. One of the exercises in this chapter asks you to modify the preceding pattern to allow for other possible occurrences that might be relevant to the uses of Star that are of interest.

The patterns that you want to match are, in general, different from those that you want not to match. So it is generally straightforward to be sure that the pattern does not match any of the undesired character sequences, with one exception: You want to match the five-character sequence of characters Star. (with an initial uppercase S) but not match the five-character sequence of characters star. (with an initial lowercase s).

239