Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Beginning Regular Expressions 2005.pdf
Скачиваний:
95
Добавлен:
17.08.2013
Размер:
25.42 Mб
Скачать

Documenting and Debugging Regular Expressions

You cannot safely try to match only the preceding variants of Leonardo da Vinci, because you might find phrases such as the following:

the great da Vinci sgnr da Vinci

Sgnr da Vinci

Mr. da Vinci

You might try a very nonspecific literal pattern such as vinci used with case-insensitive matching to gain an impression of what variety of usage there is in the data you are working with. Inevitably, such a nonspecific pattern will return undesired matches, such as vinci in words such as invincible. In this exploratory phase, undesired matches don’t matter. What is important is that you see all likely forms of the name you want to match.

Once you see the range of spelling variants in the data, you can start the process of designing an appropriate regular expression pattern.

Incorrect Spelling

People make mistakes when spelling words, even sometimes when spelling familiar words. Unless you are very lucky, incorrect spelling will be present if you are manipulating large quantities of text. To maximize sensitivity and specificity, you need to make allowance for such misspellings, at least in important or extensive text manipulation.

To allow for misspelling, it can be helpful to use exploratory patterns such as the following:

\b\w+\s+Training

and:

Star\s+T\w+g

The former pattern will detect words that precede Training. So you might pick up variants such as Satr and Star. The latter pattern will pick up many possible misspellings of Training.

Spell checking can prevent some problems but can introduce others. Recently, I saw someone post information about a book by an author whose surname was, supposedly, Debate. It wasn’t. The problem arose because a spell checker had changed a surname that it didn’t recognize into a word with which it was familiar.

Creating Test Cases

Creating test cases can be a very useful approach when you are using regular expressions on multiple documents. As mentioned in Chapter 2, it is important that you understand the data source that you are working with. The larger the number of documents or the more extensive the database that you are trying to search or manipulate, the more important it becomes that you take time to thoroughly understand the data that is being addressed using regular expressions.

247