Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Beginning Regular Expressions 2005.pdf
Скачиваний:
95
Добавлен:
17.08.2013
Размер:
25.42 Mб
Скачать

Chapter 6

What Is a Word?

The notion of what constitutes a word might seem, at first sight, to be obvious. But if you were asked to say which of the sequences of characters on the following lines were words, what would your answer be? And what criteria would you use to arrive at your opinion?

cat ja jar Nein parr smolt pomme Claire spil

How many words did you identify? Probably cat and jar were on your list. But what about ja and Nein? If you know German, you almost certainly would classify both those as words. Similarly, if you know French, you would identify pomme as a word, but if you had no knowledge of French, you might take the opposite view. And an English speaker familiar with the life cycle of the Atlantic salmon would have no difficulty in identifying parr and smolt as stages in that life cycle, but some other English speakers might not be familiar with those words.

Clearly, it isn’t realistic to expect a text processor to have knowledge about what is or isn’t a word in English, French, German, or any of a host of other languages. Similarly, you can’t expect a text processor to have knowledge in all technical areas. So you need another technique — a more mechanistic technique — to allow identification of word boundaries.

Identifying Word Boundaries

A word boundary can be viewed as two positions: one at the beginning of a sequence of characters that form a word and one at the end of a sequence of characters that form a word.

Depending on which tools or languages you use, there are metacharacters that match a word-boundary position occurring at the beginning of a word, a word-boundary position occurring at the end of a word, or both.

The \< Syntax

The \< metacharacter identifies a word-boundary position occurring at the beginning of a word. It is preceded by a character that is not an alphabetic character (for example, a space character) or is a beginning-of-line position.

A simple sample file, BoundaryTest.txt, is shown here:

ABC DEF GHI

GHI ABC DEF

ABC DEF GHI

CAB CBA AAA

164

String, Line, and Word Boundaries

The problem definition is as follows:

Match an uppercase A when it occurs immediately following a word boundary.

In other words, match an uppercase A when it is preceded by a nonword character or by a start-of-string or start-of-line position.

Try It Out

Matching a Beginning-of-Word Word Boundary

1.Open OpenOffice.org Writer, and open the file BoundaryTest.txt.

2.Open the Find & Replace dialog box using the Ctrl+F keyboard shortcut, and check the Regular Expressions and Match Case check boxes.

3.Enter the pattern \<A in the Search For text box.

4.Click the Find All button, and inspect the results, as shown in Figure 6-16.

Figure 6-16

165

Chapter 6

How It Works

On the first line, the A of ABC follows the start-of-text position, so there is a match.

On the second line, the A of ABC follows a space character (which is a nonword character), so there is a match.

On the third line, the A of ABC follows a start-of-line position, so there is a match for the pattern \<A.

On the final line, the A of CAB has an alphabetic character before it, so the pattern \<A does not match. The A of CBA is followed by a nonword character but is preceded by an alphabetic character, so the pattern \<A does not match.

The first A of AAA is preceded by a nonword character, so the pattern \<A matches. However, the second and third A of AAA is preceded by an alphabetic character and does not match.

The \>Syntax

The \> metacharacter signifies a word boundary that occurs at the end of a sequence of word characters. In other words, it matches a word boundary that occurs at the end of a word.

The test file, EndBoundary.txt, is shown here:

Theodore said “This is a lathe

I shaved today and my new shaving cream made a good lather.

A lathe is a tool for turning wood or metal.

The Thespian Theatre is something I am loathe to attend.

The quick brown fox jumped over the lazy dog.

The task is to match the sequence of characters the when they occur before a word boundary at the end of a word.

Try It Out

The \> Metacharacter

1.Open OpenOffice.org Writer, and open the file EndBoundary.txt.

2.Open the Find & Replace dialog box using the Ctrl+F keyboard shortcut, and check the Regular Expressions check box, but do not check the Match Case check box, because you want a caseinsensitive search on this occasion.

3.Enter the pattern the\> in the Search For text box.

4.Click the Find All button, and inspect the results, as shown in Figure 6-17.

166

String, Line, and Word Boundaries

Figure 6-17

How It Works

The matching sequence of characters the in the first line comes immediately before an exclamation mark, which is not an alphabetic character. Therefore, the sequence of characters the precedes a word boundary and matches the pattern the\>.

On the second line, the sequence of characters the in lather is followed by an alphabetic character, a lowercase r. There is no word boundary after the sequence of characters the and, thus, no match.

The matching occurrences later in the test file each precede a space character that is a nonword character, so each sequence of characters precedes an end-of-word word boundary. Each matches the pattern the\>.

167