Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Beginning Python (2005)

.pdf
Скачиваний:
158
Добавлен:
17.08.2013
Размер:
15.78 Mб
Скачать

Text Processing

4.To find out some interesting things about the directory, or any file, use os.stat:

>>>os.stat(‘.’)

(16895, 0, 2, 1, 0, 0, 0, 1112654581, 1097009078, 1019063193)

Note that the directory named ‘.’ is shorthand for the current directory.

5.If you actually want to list the files in the directory, do this:

>>>os.listdir(‘.’)

[‘.javaws’, ‘.limewire’, ‘Application Data’, ‘Cookies’, ‘Desktop’, ‘Favorites’, ‘gsview32.ini’, ‘Local Settings’, ‘My Documents’, ‘myfile.txt’, ‘NetHood’, ‘NTUSER.DAT’, ‘ntuser.dat.LOG’, ‘ntuser.ini’, ‘PrintHood’, ‘PUTTY.RND’, ‘Recent’, ‘SendTo’, ‘Start Menu’, ‘Templates’, ‘UserData’, ‘WINDOWS’]

How It Works

Most of that was perfectly straightforward and easy to understand, but let’s look at a couple of points before going on and writing a complete script or two.

First, you can easily see how you might construct an iterating script using listdir, split, and stat — but you don’t have to, because os.path provides the walk function to do just that, as you’ll see later. The walk function not only saves you the time and effort of writing and debugging an iterative algorithm where you search everything in your own way, but it also runs a bit faster because it’s a built-in to Python, but written in C, which can make things in cases like this. You probably will seldom want to write iterators in Python when you’ve already got something built-in that does the same job.

Second, note that the output of the stat call, which comes from a system call, is pretty opaque. The tuple it returns corresponds to the structure returned from the POSIX C library function of the same name, and its component values are described in the preceding table; and, of course, in the Python documentation. The stat function really does tell you nearly anything you might want to know about a file or directory, so it’s a valuable function to understand for when you’ll need it, even though it’s a bit daunting at first glance.

Try It Out

Searching for Files of a Particular Type

If you have worked with any other programming languages, you’ll like how easy searching for files is with Python. Whether or not you’ve done this before in another language, you’ll notice how the example script is extremely short for this type of work. The following example uses the os and os.path modules to search for PDF files in the directory — which means the current directory — wherever you are when you call the function. On a Unix or Linux system, you could use the command line and, for example,

the Unix find command. However, if you don’t do this too often that would mean that each time you wanted to look for files, you’d need to figure out the command-line syntax for find yet again. (Because of how much find does, that can be difficult — and that difficulty is compounded by how it expects you to be familiar already with how it works!) Also, another advantage to doing this in Python is that by using Python to search for files you can refine your script to do special things based on what you find, and as you discover new uses for your program, you can add new features to it to find files in ways that you find you need. For instance, as you search for files you may see far too many results to look at. You can refine your Python script to further winnow the results to find just what you need.

181

TEAM LinG

Chapter 11

This is a great opportunity to show off the nifty os.path.walk function, so that’s the basis of this script. This function is great because it will do all the heavy lifting of file system iteration for you, leaving you to write a simple function to do something with whatever it finds along the way:

1.Using your favorite text editor, open a script called scan_pdf.py in the directory you want to scan for PDFs and enter the following code:

import os, os.path import re

def print_pdf (arg, dir, files): for file in files:

path = os.path.join (dir, file) path = os.path.normcase (path) if re.search (r”.*\.pdf”, path):

print path

os.path.walk (‘.’, print_pdf, 0)

2.Run it. Obviously, the following output will not match yours. For the best results, add a bunch of files that end in .pdf to this directory!

$ python scan_pdf.py

.\95-04.pdf

.\non-disclosure agreement 051702.pdf

.\word pro - dokument in lotus word pro 9 dokument45.pdf

.\101translations\2003121803\2003121803.pdf

.\101translations\2004101810\scan.pdf

.\bluemangos\purchase order - michael roberts smb-pt134.pdf

.\bluemangos\smb_pt134.pdf

.\businessteam.hu\aok.pdf

.\businessteam.hu\chn14300-2.pdf

.\businessteam.hu\diplom_bardelmeier.pdf

.\businessteam.hu\doktor_bardelmeier.pdf

.\businessteam.hu\finanzamt_1.pdf

.\businessteam.hu\zollbescheinigung.pdf

.\businessteam.hu\monday\s3.pdf

.\businessteam.hu\monday\s4.pdf

.\businessteam.hu\monday\s5.pdf

.\gerard\done\tg82-20nc-md-04.07.pdf

.\gerard\polytronic\iau-reglement_2005.pdf

.\gerard\polytronic\tg82-20bes user manual\tg82-20bes-md-27.05.pdf

.\glossa\neumag\de_993_ba_s5.pdf

.\glossa\pepperl+fuchs\5626eng3con\vocab - 3522a_recom_flsd.pdf

.\glossa\pepperl+fuchs\5769eng4\5769eng4 - td4726_8400 d-e - 16.02.04.pdf

How It Works

This is a nice little script, isn’t it? Python does all the work, and you get a list of the PDFs in your directories, including their location and their full names — even with spaces, which can be difficult to deal with under Unix and Linux.

182

TEAM LinG

Text Processing

A little extra work with the paths has been done so that it’s easier to see what’s where: a call to os.path.join builds the full (relative) pathname of each PDF from the starting directory and a call to os.path.normcase makes sure that all the filenames are lowercase under Windows. Under Unix,

normcase would have no effect, because case is significant under Unix, so you don’t want to change the capitalization (and it doesn’t change it), but under Windows, it makes it easier to see whether the filename ends in .pdf if you have them all appear in lowercase.

Note the use of a very simple regular expression to check the ending of the filename. You could also have used os.path.splitext to get a tuple with the file’s base name and its extension, and compared that to pdf, which arguably would have been cleaner. However, because this script is effectively laid out as a filter, starting it out with a regular expression, also called regexp, comparison from the beginning makes sense. Doing it this way means that if you decide later to restrict the output in some way, like adding more filters based on needs you find you have, you can just add more regexp comparisons and have nice, easy-to-understand code in the text expression. This is more a question of taste than anything else. (It was also a good excuse to work in a first look at regular expressions and to demonstrate that they’re really not too hard to understand.)

If you haven’t seen it before, the form r”<string constant>” simply tells Python that the string constant should suppress all special processing for backslash values. Thus, while “\n” is a string one character in length containing a newline, r”\n” is a string two characters in length, containing a backslash character followed by the letter ‘n’. Because regular expressions tend to contain a lot of backslashes, it’s very convenient to be able to suppress their special meaning with this switch.

Try It Out

Refining a Search

As it turned out, there were few enough PDF files (about 100) in the example search results that you should be able to find the files you were looking for simply by looking through the list; but very often when doing a search of this kind you first look at the results you get on the first pass and then use that knowledge to zero in on what you ultimately need. The process of zeroing in involves trying out the script, and then as you see that it could be returning better results, making successive changes to your scripts to better find the information you want.

To get a flavor of that kind of successive or iterative programming, assume that instead of just showing all the PDFs, you also want to exclude all PDFs with a space in the name. For example, because the files you were looking for were downloaded from web sites, they in fact wouldn’t have spaces, whereas many of the files you received in e-mail messages were attachments from someone’s file system and

therefore often did. Therefore, this refinement is a very likely one that you’ll have an opportunity to use:

1.Using your favorite text editor again, open scan_pdf.py and change it to look like the following (the changed portions are in italics; or, if you skipped the last example, just enter the entire code as follows):

import os, os.path import re

def print_pdf (arg, dir, files): for file in files:

183

TEAM LinG

Chapter 11

path = os.path.join (dir, file) path = os.path.normcase (path)

if not re.search (r”.*\.pdf”, path): continue if re.search (r” “, path): continue

print path

os.path.walk (‘.’, print_pdf, 0)

2.Now run the modified script — and again, this output will not match yours:

$ python scan_pdf.py

.\95-04.pdf

.\101translations\2003121803\2003121803.pdf

.\101translations\2004101810\scan.pdf

.\bluemangos\smb_pt134.pdf

.\businessteam.hu\aok.pdf

.\businessteam.hu\chn14300-2.pdf

.\businessteam.hu\diplom_bardelmeier.pdf

.\businessteam.hu\doktor_bardelmeier.pdf

.\businessteam.hu\finanzamt_1.pdf

.\businessteam.hu\zollbescheinigung.pdf

.\businessteam.hu\monday\s3.pdf

.\businessteam.hu\monday\s4.pdf

.\businessteam.hu\monday\s5.pdf

.\gerard\done\tg82-20nc-md-04.07.pdf

.\gerard\polytronic\iau-reglement_2005.pdf

.\glossa\neumag\de_993_ba_s5.pdf

How It Works

There’s a stylistic change in this code — one that works well when doing these quick text-processing- oriented filter scripts. Look at the print_pdf function in the code — first build and normalize the pathname and then run tests on it to ensure that it’s the one you want. After a test fails, it will use continue to skip to the next file in the list. This technique enables a whole series of tests to be performed one after another, while keeping the code easy to read.

Working with Regular Expressions and the re Module

Perhaps the most powerful tool in the text processing toolbox is the regular expression. While matching on simple strings or substrings is useful, they’re limited. Regular expressions pack a lot of punch into a few characters, but they’re so powerful that it really pays to get to know them. The basic regular expression syntax is used identically in several programming languages, and you can find at least one book written solely on their use and thousands of pages in other books (like this one).

184

TEAM LinG

Text Processing

As mentioned previously, a regular expression defines a simple parser that matches strings within a text. Regular expressions work essentially in the same way as wildcards when you use them to specify multiple files on a command line, in that the wildcard enables you to define a string that matches many different possible filenames. In case you didn’t know what they were, characters like * and ? are wildcards that, when you use them with commands such as dir on Windows or ls on Unix, will let you select more than one file, but possiblly fewer files than every file (as does dir win*, which will print only files in your directory on Windows that start with the letters w, i, and n and are followed by anything — that’s why the * is called a wildcard). There are two major differences between a regular expression and a simple wildcard:

A regular expression can match multiple times anywhere in a longer string.

Regular expressions are much, much more complicated and much richer than simple wildcards, as you will see.

The main thing to note when starting to learn about regular expressions is this: A string always matches itself. Therefore, for instance, the pattern ‘xxx’ will always match itself in ‘abcxxxabc’. Everything else is just icing on the cake; the core of what we’re doing is just finding strings in other strings.

You can add special characters to make the patterns match more interesting things. The most commonly used one is the general wildcard ‘.’ (a period, or dot). The dot matches any one character in a string; so, for instance, ‘x.x’ will match the strings ‘xxx’ or ‘xyx’ or even ‘x.x’.

The last example raises a fundamental point in dealing with regular expressions. What if you really only want to find something with a dot in it, like ‘x.x’? Actually, specifying ‘x.x’ as a pattern won’t work; it will also match ‘x!x’ and ‘xqx’. Instead, regular expressions enable you to escape special characters by adding a backslash in front of them. Therefore, to match ‘x.x’ and only ‘x.x’, you would use the pattern ‘x\.x’, which takes away the special meaning of the period as with an escaped character.

However, here you run into a problem with Python’s normal processing of strings. Python also uses the backslash for escape sequences, because ‘\n’ specifies a carriage return and ‘\t’ is a tab character. To avoid running afoul of this normal processing, regular expressions are usually specified as raw strings, which as you’ve seen is a fancy way of saying that you tack an ‘r’ onto the front of the string constant, and then Python treats them specially.

So after all that verbiage, how do you really match ‘x.x’? Simple: You specify the pattern r”x\.x”. Fortunately, if you’ve gotten this far, you’ve already made it through the hardest part of coming to grips with regular expressions in Python. The rest is easy.

Before you get too far into specifying the many special characters used by regular expressions, first look at the function used to match strings, and then do some learning by example, by typing a few regular expressions right into the interpreter.

185

TEAM LinG

Chapter 11

Try It Out

Fun with Regular Expressions

This exercise uses some functional programming tools that you may have seen before but perhaps not had an opportunity to use yet. The idea is to be able to apply a regular expression to a bunch of different strings to determine which ones it matches and which ones it doesn’t. To do this in one line of typing, you can use the filter function, but because filter applies a function of one argument to each member of its input list, and re.match and re.search take two arguments, you’re forced to use either a function definition or an anonymous lambda form (as in this example). Don’t think too hard about it (you can return to Chapter 9 to see how this works again), as it will be obvious what it’s doing:

1.Start the Python interpreter and import the re module:

$ python

>>>import re

2.Now define a list of interesting-looking strings to filter with various regular expressions:

>>>s = (‘xxx’, ‘abcxxxabc’, ‘xyx’, ‘abc’, ‘x.x’, ‘axa’, ‘axxxxa’, ‘axxya’)

3.Do the simplest of all regular expressions first:

>>>filter ((lambda s: re.match(r”xxx”, s)), s)

(‘xxx’,)

4.Hey, wait! Why didn’t that find ‘axxxxa’, too? Even though you normally talk about matches inside the string, in Python the re.match function looks for matches only at the start of its input. To find strings anywhere in the input, use re.search (which spells the word research, so it’s cooler and easy to remember anyway):

>>>filter ((lambda s: re.search(r”xxx”, s)), s)

(‘xxx’, ‘abcxxxabc’, ‘axxxxa’)

5.OK, look for that period:

>>>filter ((lambda s: re.search(r”x.x”, s)), s) (‘xxx’, ‘abcxxxabc’, ‘xyx’, ‘x.x’, ‘axxxxa’)

6.Here’s how you match only the period (by escaping the special character):

>>>filter ((lambda s: re.search(r”x\.x”, s)), s) (‘x.x’,)

7.You also can search for any number of x’s by using the asterisk, which can match a series of whatever character is in front of it:

>>>filter ((lambda s: re.search(r”x.*x”, s)), s)

(‘xxx’, ‘abcxxxabc’, ‘xyx’, ‘x.x’, ‘axxxxa’, ‘axxya’)

186

TEAM LinG

Text Processing

8.Wait a minute! How did ‘x.*x’ match ‘axxya’ if there was nothing between the two x’s? The secret is that the asterisk is tricky — it matches zero or more occurrences of a character between two x’s. If you really want to make sure something is between the x’s, use a plus instead, which matches one or more characters:

>>>filter ((lambda s: re.search(r”x.+x”, s)), s)

(‘xxx’, ‘abcxxxabc’, ‘xyx’, ‘x.x’, ‘axxxxa’)

9.Now you know how to match anything with, say, an ‘c’ in it:

>>>filter ((lambda s: re.search(r”c+”, s)), s) (‘abcxxxabc’, ‘abc’)

10.Here’s where things get really interesting: How would you match anything without an ‘c’? Regular expressions use square brackets to denote special sets of characters to match, and if there’s a caret at the beginning of the list, it means all characters that don’t appear in the set, so your first idea might be to try this:

>>>filter ((lambda s: re.search(r”[^c]*”, s)), s)

(‘xxx’, ‘abcxxxabc’, ‘xyx’, ‘abc’, ‘x.x’, ‘axa’, ‘axxxxa’, ‘axxya’)

11.That matched the whole list. Why? Because it matches anything that has a character that isn’t an ‘c’, you negated the wrong thing. To make this clearer, you can filter a list with more c’s in it:

>>>filter ((lambda s: re.search(r”[^c]*”, s)), (‘c’, ‘cc’, ‘ccx’))

(‘c’, ‘cc’, ‘ccx’)

Note that older versions of Python may return a different tuple, (‘ccx’,).

12.To really match anything without an ‘c’ in it, you have to use the ^ and $ special characters to refer to the beginning and end of the string and then tell re that you want strings composed only of non-c characters from beginning to end:

>>>filter ((lambda s: re.search(r”^[^c]*$”, s)), s)

(‘xxx’, ‘xyx’, ‘x.x’, ‘axa’, ‘axxxxa’, ‘axxya’)

As you can see from the last example, getting re to understand what you mean can sometimes require a little effort. It’s often best to try out new regular expressions on a bunch of data you understand and then check the results carefully to ensure that you’re getting what you intended; otherwise, you can get some real surprises later!

Use the techniques shown here in the following example. You can usually run the Python interpreter in interactive mode, and test your regular expression with sample data until it matches what you want.

Try It Out

Adding Tests

The example scan_pdf.py scripts shown so far provide a nicely formatted framework for testing files. As mentioned previously, the os.path.walk function provides the heavy lifting. The print_pdf function you write performs the tests — in this case, looking for PDF files.

187

TEAM LinG

Chapter 11

Clocking in at less than 20 lines of code, these examples show the true power of Python. Following the structure of the print_pdf function, you can easily add tests to refine the search, as shown in the following example:

1.Using your favorite text editor again, open scan_pdf.py and change it to look like the following. The changed portions are in italics (or, if you skipped the last example, just enter the entire code that follows):

import os, os.path import re

def print_pdf (arg, dir, files): for file in files:

path = os.path.join (dir, file) path = os.path.normcase (path)

if not re.search (r”.*\.pdf”, path): continue if re.search (r”.\.hu”, path): continue

print path

os.path.walk (‘.’, print_pdf, 0)

2.Now run the modified script — and again, this output will not match yours:

C:\projects\translation>python scan_pdf.py

.\businessteam.hu\aok.pdf

.\businessteam.hu\chn14300-2.pdf

.\businessteam.hu\diplom_bardelmeier.pdf

.\businessteam.hu\doktor_bardelmeier.pdf

.\businessteam.hu\finanzamt_1.pdf

.\businessteam.hu\zollbescheinigung.pdf

.\businessteam.hu\monday\s3.pdf

.\businessteam.hu\monday\s4.pdf

.\businessteam.hu\monday\s5.pdf

...

How It Works

This example follows the structure set up in the previous examples and adds another test. You can add test after test to create the script that best meets your needs.

In this example, the test looks only for filenames (which include the full paths) with a .hu in the name. The assumption here is that files with a .hu in the name (or in a directory with .hu in the name) are translations from Hungarian (hu is the two-letter country code for Hungary). Therefore, this example shows how to narrow the search to files translated from Hungarian. (In real life, you will obviously require different search criteria. Just add the tests you need.)

You can continue refining your script to create a generalized search utility in Python. Chapter 12 goes into this in more depth.

188

TEAM LinG

Text Processing

Summar y

Text processing scripts are generally short, useful, reusable programs, which are either written for onetime and occasional use, or used as components of a larger data-processing system. The chief tools for the text processing programmer are directory structure navigation and regular expressions, both of which were examined in brief in this chapter.

Python is handy for this style of programming because it offers a balance where it is easy to use for simple, one-time tasks, and it’s also structured enough to ease the maintenance of code that gets reused over time.

The specific techniques shown in this chapter include the following:

Use the os.path.walk function to traverse the file system.

Place the search criteria in the function you write and pass it to the os.path.walk function.

Regular expressions work well to perform the tests on each file found by the os.path.walk function.

Try out regular expressions in the Python interpreter interactively to ensure they work.

Chapter 12 covers an important concept: testing. Testing enables you not only to ensure that your scripts work but that the scripts still work when you make a change.

Exercises

1.Modify the scan_pdf.py script to start at the root, or topmost, directory. On Windows, this should be the topmost directory of the current disk (C:, D:, and so on). Doing this on a

network share can be slow, so don’t be surprised if your G: drive takes a lot more time when it comes from a file server). On Unix and Linux, this should be the topmost directory (the root directory, /).

2.Modify the scan_pdy.py script to only match PDF files with the text boobah in the filename.

3.Modify the scan_pdf.py script to exclude all files with the text boobah in the filename.

189

TEAM LinG

TEAM LinG