Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Beginning Regular Expressions 2005.pdf
Скачиваний:
95
Добавлен:
17.08.2013
Размер:
25.42 Mб
Скачать

Chapter 9

Figure 9-7

The Sensitivity/Specificity Trade-Off

Sensitivity and specificity are always part of a trade-off. Sensitivity and specificity are components of the trade-off, but the amount of effort required to get 100 percent sensitivity and 100 percent specificity may not be practical in some situations. Some undefined “good” specificity may be enough. It’s a trade-off in that, in the end, only you can judge how much effort is appropriate for the task that you are using regular expressions to achieve.

How important are sensitivity and specificity? The answer is, “It depends.” There are many times when you will need high sensitivity, 100 percent sensitivity ideally, and at the same time you also need high specificity. At other times, one or the other may be less important. This section looks at some of the factors that influence how much importance it is relevant to place on sensitivity and specificity.

It depends to a significant extent on who the customer is. If you are using regular expressions to achieve something for your own use, you may not worry too much if you miss one or two matches. On the other hand, if you are conducting a replacement of every occurrence of a company name after a takeover, for example, it would be serious if sensitivity fell below 100 percent.

How Metacharacters Affect Sensitivity and

Specificity

In general, the more metacharacters you use, the more specific a pattern becomes. The pattern cat matches that sequence of characters whether they refer to a feline mammal or form character sequences in words such as cathode and caterpillar.

230

Sensitivity and Specificity of Regular Expressions

Adding further metacharacters, such as the \b word boundary, makes the use of the character sequence cat in a pattern much more specific. The pattern \bcat\b will match only the word cat (singular).

When using specific patterns like that, you need to watch carefully for the possibility of reducing sensitivity. The pattern \bcat\b will match cat but won’t match cats, for example. If you are interested in finding all references in the document to feline mammals, the \bcat\b pattern may not be the best option. You may want to allow for the occurrence of the plural form, cats, and the possessive form, cat’s, too. The pattern \bcat’?s?’?\b would match cat, cats, cats’ (plural possessive), and cat’s (singular possessive) but would also match cat’, which is unlikely to be a desired match. If your data is unlikely to contain the character sequence cat’, the pattern \bcat’?s?’?\b may be sufficient. But if, for some reason, you want to match only cat, cats, cat’s, and cats’, some other, more specific pattern will be needed. One simple option is as follows:

(cat|cats|cat’s|cats’)

An alternative follows:

ca(t|ts|t’s|ts)

Similar issues apply whatever the word or sequence of characters of interest.

Sensitivity, Specificity, and Positional Characters

The positional characters explored in Chapter 6 can be expected in many cases to affect both sensitivity and specificity.

In the following example, an initial version of the problem definition can be expressed as follows:

Match all occurrences of the sequence of characters t, h, and e case insensitively.

The pattern the will match twice in the following text:

Paris in the the spring.

It will match once in the following text:

The spring has sprung.

However, suppose you modify the problem definition to the following:

Match the position at the beginning of a string; then match the sequence of characters t, h, and e case insensitively.

The pattern ^the now has no match in the first sample text but still has a single match in the second. The effect of adding one or more positional metacharacters depends on the data the pattern is being matched against.

231

Chapter 9

Sensitivity, Specificity, and Modes

When you specify that a regular expression is to be executed in a case-insensitive or case-sensitive mode, you affect the matches that will be returned. Continuing with the preceding example, the pattern ^the applied case sensitively has no match in either of the two test pieces of text. In the second sample text, the ^ metacharacter matches the position at the beginning of the string, but the lowercase t of the pattern does not match the uppercase T of the test text.

Similarly, the use of the period metacharacter (which matches a large range of characters) can be switched to match or not match a newline character.

Sensitivity, Specificity, and Lookahead and Lookbehind

When you add lookahead or lookbehind to an existing regular expression, you may have no effect on sensitivity ,or you may adversely impact it. Equally, you may improve specificity or, less likely, it may stay the same.

If a lookbehind is carefully crafted, it won’t reduce sensitivity. However, if you make an error in the pattern inside the lookbehind, you will fail to match when you intended to match, reducing sensitivity. Suppose that you wanted to find information about Anne Smith. The following pattern would match when the spelling of Anne is correct, and it is followed by exactly one space character:

(?<=Anne )Smith

However, if Anne is spelled as Ann somewhere in the document you may miss intended matches, because the pattern (?<=Anne )Smith will no longer match.

Equally, if the person’s name were written as A. Smith somewhere in the data, there would be no match. More detailed understanding of the data would be needed to know whether a match was intended or not. The character sequence A. Smith might refer to the person of interest, Anne Smith, but alternatively might refer to Adam Smith or some other person.

Similarly, lookahead can reduce sensitivity. For example, suppose that you want to match all occurrences of the character sequence John. The following pattern would match a word boundary, then the desired character sequence John, and then check if the following character is a space character:

\bJohn(?= )

However, if the test text is as follows, the lookahead is too specific and causes what is likely to be a desired match to fail:

I went with John, and Mary on a trip.

Modifying the lookahead to (?=\b) or (?=\W) would prevent the problem caused by the occurrence of an unanticipated comma.

How Much Should the Regular Expressions Do?

Most of the examples earlier in this book use a range of tools with regular expression functionality to apply regular expressions. That’s great when teaching regular expressions, but when you use regular

232

Sensitivity and Specificity of Regular Expressions

expressions as a developer, you will typically be using regular expressions inside code written in Java, JavaScript, VB.NET, and so on, or you may be applying regular expressions to data retrieved from a relational database. So how much should you expect the regular expressions to do, and how much can you safely assume that your other code or the error checking in a database already does?

For example, suppose you have a collection of HTML documents that include IP addresses, and your task is to amend the style that the IP addresses are displayed in. Suppose that initially, IP addresses are nested inside the start and end tags for HTML b elements, as in the following:

<b>1.12.123.234</b>

What pattern should you use to find such IP addresses? Should you just assume that the data you receive will be correctly formed (including having no values of 256 or more), or should you include a more complex pattern so that the regular expression will match only correctly formed IP addresses?

If you assume that the IP addresses are already correctly formed or are checked by some other part of your code, you could use a fairly simple pattern such as the following:

<b>([0-9]+(\.[0-9]+){3})</b>

This would match character sequences that are not IP addresses, such as the following:

<b>1234.2345.5678.9999999</b>

If you can be sure that your data doesn’t include undesired values such as the preceding one, the simple pattern shown might be enough. Without much work, you can adapt the pattern so that only between one and three numeric digits can be included before or after a period character:

<b>([0-9]{1,3}(\.[0-9]{1,3})</b>

Inappropriate character sequences such as the following would still be matched, but at least you improve the specificity a little by excluding false matches with multiple numeric digits, as shown earlier:

<b>999.256.789.1</b>

If, however, you can’t be sure that the supposed IP addresses are correctly formatted, you may need to develop a longer, more complex pattern. On the other hand, it may not matter for a particular purpose whether the supposed IP addresses are valid IP addresses or not. If that is the situation, the simplest regular expression is likely to be an appropriate option to use.

Knowing the Data, Sensitivity, and

Specificity

One of the key issues that affect how well you achieve sensitivity and specificity is how well you understand the data to which you are applying regular expressions. Of course, your understanding of the regular expression syntax and techniques supported by your chosen language or tool is important, too.

233