Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Beginning Regular Expressions 2005.pdf
Скачиваний:
95
Добавлен:
17.08.2013
Размер:
25.42 Mб
Скачать

Chapter 9

What Are Sensitivity and Specificity?

Sensitivity is the capacity to match the pattern that you want to match. Specificity is the capacity to limit the character sequences selected by a pattern to those character sequences that you want to detect.

Sensitivity and specificity are terms derived from quantitative disciplines such as statistics and epidemiology. Broadly, sensitivity is a measure of the number of true hits you find divided by the total number of true hits you ought to find if you match all occurrences of the relevant character sequences, and specificity is the number of hits you find that are true hits divided by the total number of hits you find. The higher the sensitivity, the closer you are, in the context of regular expressions, to finding all true matches, and the higher the specificity, the closer you are to finding only true matches.

The definitions given may feel a little abstract, so the following examples are provided to develop a clearer understanding of the ideas of sensitivity and specificity.

Extreme Sensitivity, Awful Specificity

Suppose that you want to match the character sequence ABC. It is very easy to achieve 100 percent sensitivity using the following pattern:

.*

It selects sequences of zero or more alphanumeric characters.

A sample document, ABitOfEverything.txt, is shown here:

ABC123

DEF9FR

Mary had a little lamb.

var x = 234 / 1.56;

<html><body></body></html>

<book></book>

This is a random 58#Gooede garbled piece of 8983ju**nk but it is still selected.

222

Sensitivity and Specificity of Regular Expressions

As you can see, there is a pretty diverse range of content, not all of which is useful. However, if you apply the regular expression pattern .* you achieve 100 percent sensitivity, because the only occurrence of the character sequence ABC is matched. However, you also select every other piece of text in the sample document, as you can see in Figure 9-1 in OpenOffice.org Writer.

Figure 9-1

I introduced this slightly silly example to make an important point. It is possible to create very sensitive regular expression patterns that achieve nothing useful. Of course, you are unlikely to use .* as a standalone pattern, but it is important to carefully consider the usefulness of the regular expression patterns you create when, typically, the issues will be significantly more subtle.

Useful regular expressions keep the 100 percent sensitivity (or something very close to 100 percent) of the .* pattern but combine it with a high level of specificity.

223

Chapter 9

Email Addresses Example

Suppose that you have a large number of documents or an email mail file that you need to search for valid email addresses. The file EmailOrNotEmail.txt illustrates the kind of data that might be contained in the material you need to search. The content of EmailOrNotEmail.txt is shown here:

@Home

@ttitude

John@somewhere.invalid

Peter@example.org

Peter@example.info

John@Smith@example.com 20 @ $10 each

@@@ This is a comment @@@

Jane@example.net

Peter.Smith@example.net

You will see pretty quickly that some of the character sequences in EmailOrNotEmail.txt are valid email addresses and some are not.

One approach to matching email addresses would be to use the following regular expression to locate all email addresses:

.*@.*

If you try that pattern using the findstr utility, you can type the following at the command line:

findstr /N /i .*@.* EmailOrNotEmail.txt

You search a single file, EmailOrNotEmail.txt, for the following regular expression pattern:

.*@.*

The /N switch indicates that the line number of any line containing a character sequence that matches the regular expression pattern will be displayed. The /i switch, which isn’t essential here, indicates that the pattern will be applied in a case-insensitive way. Figure 9-2 shows the result of running the specified command.

Figure 9-2

224

Sensitivity and Specificity of Regular Expressions

As the figure shows, all the valid email addresses (which are on lines 4, 5, 9, and 10) are selected. This gives you 100 percent sensitivity, at least on this test data set. In other words, you have selected every character sequence that represents a valid email address. But you have, on all the other lines, matched character sequences that are pretty obviously not email addresses. You need to find a more specific pattern to improve the specificity of matching.

Look a little more carefully at how an email address is structured. Broadly, an email address follows this structure:

username@somehostname

To achieve a better match, you must find patterns that match the username and the hostname but are more specific than your previous attempt.

The structure of the username can be simply a sequence of alphabetic characters, as here:

AWatt@XMML.com

Or it can include a period character, such as the following:

A.Watt@XMML.com

Therefore, you need to allow for the possibility of a period character occurring inside the username part of the email address. The following pattern matches, at a minimum, a single alphabetic character due to the \w+ component of the pattern:

\w*\.?\w+

The \w*\.? allows the mandatory alphabetic character(s) to be preceded by zero or more optional alphabetic characters followed by a single optional period character.

You probably don’t want to match an email address that begins with a period character, as in the following:

.Watt@XMML.com

So you could use a lookbehind to allow a match for a period character only when it has been preceded by at least one alphabetic character. This pattern would allow matching of a period character only when it is preceded by an alphabetic character:

\w*(?<=\w)\.?\w+

Try It Out

Email Address

1.Open PowerGrep, and enter the pattern \w*(?<=\w)\.?\w+@.* in the Search text area.

2.Enter the folder name C:\BRegExp\Ch09 in the Folder text box. Amend, as appropriate, if you downloaded the sample files to a different directory.

3.Enter the filename EmailOrNotEmail.txt in the File Mask text box, and click the Search button.

4.Inspect the results in the Results area. Compare the matches shown in Figure 9-2 with the matches now shown in Figure 9-3, particularly noting the character sequences that no longer match.

225

Chapter 9

Figure 9-3

This is an improvement. The pattern is more specific. You no longer match the undesired character sequences on lines 1, 2, 7, and 8. However, the character sequence on Line 3, John@somewhere.invalid, is not a valid email address.

You can remove that undesired match by making the hostname part of the email address more specific. How specific you want to be is a matter of judgment. You know that all hostnames will have a sequence of alphabetic characters, followed by a period character, followed by three (com, net, org, or biz) or four (info) alphabetic characters. For the purposes of this example we won’t consider hostnames like example.co.uk. The following pattern would be an appropriate pattern to match hostnames that correspond to the structure just described:

\w+\.\w{3,4}

The \w+ will match even single character domain names (which are allowed with .com, .net, and .org domains). The \. metacharacter matches a single period character, and the \w{3,4} component matches either three or four alphabetic characters.

Combining that pattern with your earlier one gives you the following:

\w*(?<=\w)\.?\w+@\w+\.\w{3,4}

5.Enter the pattern \w*(?<=\w)\.?\w+@\w+\.\w{3,4} in the Search text area, and click the Search button.

6.Inspect the results. Notice that the undesired match on Line 3 is no longer matched. However, a problem on Line 6, not mentioned earlier, is brought to the surface. On Line 6, the seeming email address has two @ characters, which is not allowed.

One way to approach this is to use a lookahead to specify that following the first match for an @ character, another @ character does not occur. If you continue to assume that only alphabetic characters are allowed in an email address, you can specify that you look ahead from the first @ character matched to the first match for a character that is not an alphabetic character or a period character.

226

Sensitivity and Specificity of Regular Expressions

You can do that using the following pattern:

\w*(?<=\w)\.?\w+@(?=[\w\.]+\W)\w+\.\w{3,4}

7.Edit the pattern in the Search text area to be

\w*(?<=\w)\.?\w+@(?=[\w\.]+\W)\w+\.\w{3,4}, and click the Search button.

8.Inspect the results. Figure 9-4 shows the appearance.

Figure 9-4

Unfortunately, the lookahead has not solved the problem with the undesired matches on lines 3 and 6. You need to specify that the pattern is the whole text on a line. In other words, you add a ^ metacharacter to specify the position at the start of the line and the $ metacharacter to specify the position at the end of the line.

9.Modify the pattern in the Search area to be

^\w*(?<=\w)\.?\w+@(?=[\w\.]+\W)\w+\.\w{3,4}$, and click the Search button.

10.Inspect the results. Figure 9-5 shows the appearance.

Figure 9-5

227