Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Beginning Regular Expressions 2005.pdf
Скачиваний:
95
Добавлен:
17.08.2013
Размер:
25.42 Mб
Скачать

Chapter 9

Happily, you have now succeeded in avoiding matching the undesired matches on lines 3 and 6. At least on this simple test data, you have achieved 100 percent sensitivity and 100 percent specificity.

The terms sensitivity and specificity come from quantitative sciences, such as statistics and epidemiology. In those contexts, both the sensitivity and specificity are expressed numerically, often as percentages. So for the preceding example, you have a sensitivity of 100 percent because all true email addresses are detected using your first attempt at a regular expression pattern, and you initially have a specificity of 40 percent because 6 of the 10 matches are false matches (in the sense that they are not valid email addresses). By the end of the Try It Out example, the specificity has risen to 100 percent on the test data.

Replacing Hyphens Example

This example looks at another problem that can occur if you are not careful in thinking through the meaning of a regular expression.

Assume that you have a collection of text documents that have to be converted into HTML/XHTML. This example focuses on the possible need for replacing a line of hyphens with the HTML/XHTML <hr> element to create a horizontal ruled line.

A simplified sample document, HyphenTest.txt, is used in this example:

something not much

----

a little text Fred

-------------

-Fred

A first attempt at expressing the problem definition might be as follows:

Replace any hyphens that occur with the character sequence <hr>.

However, that is too imprecise. For example, the third line would be replaced with the following:

<hr><hr><hr><hr>

A more precise statement of the problem definition would be as follows:

Replace any group of consecutive hyphens with the character sequence <hr>.

Assume that you will omit the end tag of the hr element, because many Web browsers have problems if you use the empty element tag, <hr/>.

If you use the following regular expression pattern to express the idea of one or more hyphens, you can run into problems for two reasons:

-*

228

Sensitivity and Specificity of Regular Expressions

First, not all regular expression engines interpret that pattern correctly. The pattern -* means “Match zero or more hyphens,” which means that the occurrence of zero hyphens is a match. Therefore, the text Fred ought to match, which may not be what you expected. Why does Fred match? Because there are zero hyphens.

OpenOffice.org Writer implements the -* pattern as you might intuitively expect, because it matches only when at least one hyphen occurs, as shown in Figure 9-6, when it ought to match on each line because each line has zero hyphens at the beginning.

Figure 9-6

The Komodo Regular Expressions Toolkit interprets the regular expression pattern correctly — for example, detecting a match for the text Fred, as you can see in Figure 9-7.

Of course, the pattern -+ is more appropriate because you want at least one hyphen to be present before you expect a match. However, the fact that the * quantifier matches even the absence of the character or metacharacter that it refers to can cause confusion in some situations.

229