Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Beginning Regular Expressions 2005.pdf
Скачиваний:
95
Добавлен:
17.08.2013
Размер:
25.42 Mб
Скачать

Chapter 10

Know Your Data

When you are using regular expressions, it is crucial that you really understand the data you are working with.

The following sections illustrate the type of problem that may lie in your data.

Abbreviations

If you are handling large volumes of text, you may find that abbreviations for a term of interest can cause problems in matching.

Suppose that you want to locate information about Dr. Victor Smith. Among the forms you might find are:

Dr. Smith

Dr. V. Smith

Victor Smith

Doctor V. Smith

Doctor Victor Smith

Dr Victor Smith

As you can see, the appellation can be written as Doctor, Dr (with no period character), or Dr. (with a period character).

Technical terms are also often abbreviated and raise similar issues. For example, if you wanted to find information about Microsoft’s Most Valuable Professionals, you would need to match forms such as the following:

MVP

MVPs

Most Valuable Professional

Proper Names

If the relevant part of the data involves proper names, whether of people, businesses, or places, all sorts of problems can arise.

If, for example, you are interested in the work of the famous artist Leonardo da Vinci, you might find any of the following variants in the data:

Leonardo Da Vinci

Leonardo da Vinci

Leonardo DaVinci

Leonardo daVinci

Notice the variations in case among the four examples and the variations in whether or not there is a space character before Vinci.

246