Beginning Python (2005)
.pdfBuilding a Module
In most cases, you’ll want to place your Python modules in the site-packages directory. Look in the sys.path listing and find a directory name ending in site-packages. This is a directory for packages installed at a site that are not part of the Python standard library of packages.
In addition to modules, you can create packages of modules, a set of related modules that install into the same directory structure. See the Python documentation at http://docs.python.org for more on this subject.
You can install your modules using one of three mechanisms:
You can do everything by hand and manually create an installation script or program.
You can create an installer specific to your operating system, such as MSI files on Windows, an RPM file on Linux, or a DMG file on Mac OS X.
You can use the handy Python distutils package, short for distribution utilities, to create a Python-based installer.
To use the Python distutils, you need to create a setup script, named setup.py. A minimal setup script can include the following:
from distutils.core import setup
setup(name=’NameOfModule’,
version=’1.0’, py_modules=[‘NameOfModule’],
)
You need to include the name of the module twice. Replace NameOfModule with the name of your module, such as meal in the examples in this chapter.
Name the script setup.py.
After you have created the setup.py script, you can create a distribution of your module using the following command:
python setup.py sdist
The argument sdist is short for software distribution. You can try this out with the following example.
Try It Out |
Creating an Installable Package |
Enter the following script and name the file setup.py:
from distutils.core import setup
setup(name=’meal’,
version=’1.0’, py_modules=[‘meal’],
)
171
TEAM LinG
Chapter 10
Run the following command to create a Python module distribution:
$ python setup.py sdist running sdist
warning: sdist: missing required meta-data: url
warning: sdist: missing meta-data: either (author and author_email) or (maintainer and maintainer_email) must be supplied
warning: sdist: manifest template ‘MANIFEST.in’ does not exist (using default file list)
warning: sdist: standard file not found: should have one of README, README.txt writing manifest file ‘MANIFEST’
creating meal-1.0
making hard links in meal-1.0...
hard linking meal.py -> meal-1.0 hard linking setup.py -> meal-1.0 creating dist
tar -cf dist/meal-1.0.tar meal-1.0 gzip -f9 dist/meal-1.0.tar
removing ‘meal-1.0’ (and everything under it)
How It Works
Notice all the complaints. The setup.py script was clearly not complete. It included enough to create the distribution, but not enough to satisfy the Python conventions. When the setup.py script completes, you should see the following files in the current directory:
$ ls
MANIFEST dist/ meal.py setup.py
The setup.py script created the dist directory and the MANIFEST file. The dist directory contains one file, a compressed version of our module:
$ ls dist
meal-1.0.tar.gz
You now have a one-file distribution of your module, which is kind of silly because the module itself was just one file. The advantage of distutils is that your module will be properly installed.
You can then take the meal-1.0.tar.gz file to another system and install the module. First, uncompress and expand the bundle. On Linux, Unix, and Mac OS X, use the following commands:
$ gunzip meal-1.0.tar.gz $ tar xvf meal-1.0.tar meal-1.0/
meal-1.0/meal.py meal-1.0/PKG-INFO meal-1.0/setup.py
On Windows, use a compression program such as WinZip, which can handle the .tar.gz files.
You can install the module after it is expanded with the following command:
python setup.py install
172 |
TEAM LinG |
Building a Module
For example:
$ python setup.py install running install
running build running build_py creating build creating build/lib
copying meal.py -> build/lib running install_lib
copying build/lib/meal.py -> /System/Library/Frameworks/Python.framework/ Versions/2.3/lib/python2.3/site-packages
byte-compiling /System/Library/Frameworks/Python.framework/Versions/2.3/lib/ python2.3/site-packages/meal.py to meal.pyc
The neat thing about the distutils is that it works for just about any Python module. The installation command is the same, so you just need to know one command to install Python modules on any system.
Another neat thing is that the installation creates documentation on your module that is viewable with the pydoc command. For example, the following shows the first page of documentation on the meal module:
$ pydoc meal
Help on module meal:
NAME
meal - Module for making meals in Python.
FILE
/Users/ericfj/writing/python/inst2/meal-1.0/meal.py
DESCRIPTION
Import this module and then call makeBreakfast(), makeDinner() or makeLunch().
CLASSES exceptions.Exception
SensitiveArtistException
AngryChefException
Meal
Breakfast Dinner Lunch
class AngryChefException(SensitiveArtistException) | Exception that indicates the chef is unhappy.
:
See the Python documentation at www.python.org/doc/2.4/dist/dist.html for more on writing distutils setup scripts.
173
TEAM LinG
Chapter 10
Summar y
This chapter pulls together concepts from the earlier chapters to delve into how to create modules by example. If you follow the techniques described in this chapter, your modules will fit in with other modules and follow the import Python conventions.
A module is simply a Python source file that you choose to treat as a module. Simple as that sounds, you need to follow a few conventions when creating a module:
Document the module and all classes, methods, and functions in the module.
Test the module and include at least one test function.
Define which items in the module to export — which classes, functions, and so on.
Create any exception classes you need for the issues that can arise when using the module.
Handle the situation in which the module itself is executed as a Python script.
Inside your modules, you’ll likely define classes, which Python makes exceedingly easy.
While developing your module, you can use the help and reload functions to display documentation on your module (or any other module for that matter) and reload the changed module, respectively.
After you have created a module, you can create a distributable bundle of the module using the distutils. To do this, you need to create a setup.py script.
Chapter 11 describes regular expressions, an important concept used for finding relevant information in a sea of data.
Exercises
1.How can you get access to the functionality provided by a module?
2.How can you control which items from your modules are considered public? (Public items are available to other Python scripts.)
3.How can you view documentation on a module?
4.How can you find out what modules are installed on a system?
5.What kind of Python commands can you place in a module?
174 |
TEAM LinG |
11
Text Processing
There is a whole range of applications for which scripting languages like Python are perfectly suited; and in fact scripting languages were arguably invented specifically for these applications, which involve the simple search and processing of various files in the directory tree. Taken together, these applications are often called text processing. Python is a great scripting tool for both writing quick text processing scripts and then scaling them up into more generally useful code later, using its clean object-oriented coding style. This chapter will show you the following:
Some of the typical reasons you need text processing scripts
A few simple scripts for quick system administration tasks
How to navigate around in the directory structure in a platform-independent way, so your scripts will work fine on Linux, Windows, or even the Mac
How to create regular expressions to compare the files found by the os and os.path modules.
How to use successive refinement to keep enhancing your Python scripts to winnow through the data found.
Text processing scripts are one of the most useful tools in the toolbox of anybody who seriously works with computer systems, and Python is a great way to do text processing. You’re going to like this chapter.
Why Text Processing Is So Useful
In general, the whole idea behind text processing is simply finding things. There are, of course, situations in which data is organized in a structured way; these are called databases and that’s not what this chapter is about. Databases carefully index and store data in such a way that if you know what you’re looking for, you can retrieve it quickly. However, in some data sources, the information is not at all orderly and neat, such as directory structures with hundreds or thousands of files, or logs of events from system processes consisting of thousands or hundreds of thousands of lines, or even e-mail archives with months of exchanges between people.
TEAM LinG
Chapter 11
When data of that nature needs to be searched for something, or processed in some way, then text processing is in its element. Of course, there’s no reason not to combine text processing with other dataaccess methods; you might find yourself writing scripts rather often that run through thousands of lines of log output and do occasional RDBMS lookups (Relational DataBase Management Systems — you’ll learn about these in Chapter 14) on some of the data they run across. This is a natural way to work.
Ultimately, this kind of script can very often get used for years as part of a back-end data processing system. If the script is written in a language like Perl, it can sometimes be quite opaque when some poor soul is assigned five years later to “fix it.” Fortunately, this is a book about Python programming, and so the scripts written here can easily be turned into reusable object classes — later, you’ll look at an illustrative example.
The two main tools in your text processing belt are directory navigation, and an arcane technology called regular expressions. Directory navigation is one area in which different operating systems can really wreak havoc on simple programs, because the three major operating system families (Unix, Windows, and the Mac) all organize their directories differently; and, most painfully, they use different characters to separate subdirectory names. Python is ready for this, though — a series of cross-platform tools are available for the manipulation of directories and paths that, when used consistently, can eliminate this hassle entirely. You saw these in Chapter 8, and you’ll see more uses of these tools here.
A regular expression is a way of specifying a very simple text parser, which then can be applied relatively inexpensively (which means that it will be fast) to any number of lines of text. Regular expressions crop up in a lot of places, and you’ve likely seen them before. If this is your first exposure to them, however, you’ll be pretty pleased with what they can do. In the scope of this chapter, you’re just going to scratch the surface of full-scale regular expression power, but even this will give your scripts a lot of functionality.
You’ll first look at some of the reasons you might want to write text processing scripts, and then you’ll do some experimentation with your new knowledge. The most common reasons to use regular expressions include the following:
Searching for files
Extracting useful data from program logs, such as a web server log
Searching through your e-mail
The following sections introduce these uses.
Searching for Files
Searching for files, or doing something with some files, is a mainstay of text processing. For example, suppose that you spent a few months ripping your entire CD collection to MP3 files, without really paying attention to how you were organizing the hundreds of files you were tossing into some arbitrarily made-up set of directories. This wouldn’t be a problem if you didn’t wait a couple of months before thinking about organizing your files into directories according to artist — and only then realized that the directory structure you ended up with was hopelessly confused.
176 |
TEAM LinG |
Text Processing
Text processing to the rescue! Write a Python script that scans the hopelessly nonsensical directory structure and then divide each filename into parts that might be an artist’s name. Then take that potential name and try to look it up in a music database. The result is that you could rearrange hundreds of files into directories by, if not the name of the artist, certainly some pretty good guesses which will get you close to having a sensible structure. From there, you would be able to explore manually and end up actually having an organized music library.
This is a one-time use of a text processing script, but you can easily imagine other scenarios in which you might use a similarly useful script on a regular basis, as when you are handling data from a client or from a data source that you don’t control. Of course, if you need to do this kind of sorting often, you can easily use Python to come up with some organized tool classes that perform these tasks to avoid having to duplicate your effort each time.
Whenever you face a task like this, a task that requires a lot of manual work manipulating data on your computer, think Python. Writing a script or two could save you hours and hours of tedious work.
A second but similar situation results as a fallout of today’s large hard disks. Many users store files willy-nilly on their hard disk, but never seem to have the time to organize their files. A worse situation occurs when you face a hard disk full of files and you need to extract some information you know is there on your computer, but you’re not sure where exactly. You are not alone. Apple, Google, Microsoft and others are all working on desktop search techniques that help you search through the data in the files you have collected to help you to extract useful information.
Think of Python as a desktop search on steroids, because you can create scripts with a much finer control over the search, as well as perform operations on the files found.
Clipping Logs
Another common text-processing task that comes up in system administration is the need to sift through log files for various information. Scripts that filter logs can be spur-of-the-moment affairs meant to answer specific questions (such as “When did that e-mail get sent?” or “When was the last time my program log one specific message?”), or they might be permanent parts of a data processing system that evolves over time to manage ongoing tasks. These could be a part of a system administration and performance-monitoring system, for instance. Scripts that regularly filter logs for particular subsets of the information are often said to be clipping logs — the idea being that, just as you clip polygons to fit on the screen, you can also clip logs to fit into whatever view of the system you need.
However you decide to use them, after you gain some basic familiarity with the techniques used, these scripts become almost second nature. This is an application where regular expressions are used a lot, for two reasons: First, it’s very common to use a Unix shell command like grep to do first-level log clipping; second, if you do it in Python, you’ll probably be using regular expressions to split the line into usable fields before doing more work with it. In any one clipping task, you may very well be using both techniques.
After a short introduction to traversing the file system and creating regular expressions, you’ll look at a couple of scripts for text processing in the following sections.
177
TEAM LinG
Chapter 11
Sifting through Mail
The final text processing task is one that you’ve probably found useful (or if you haven’t, you’ve badly wanted it): the processing of mailbox files to find something that can’t be found by your normal Inbox search feature. The most common reason you need something more powerful for this is that the mailbox file is either archived, so that you can access the file, but not read it with your mail reader easily, or it has been saved on a server where you’ve got no working mail client installed. Rather than go through the hassle of moving it into your Inbox tree and treating it like an active folder, you might find it simpler just to write a script to scan it for whatever you need.
However, you can also easily imagine a situation in which your search script might want to get data from an outside source, such as a web page or perhaps some other data source, like a database (see Chapter 14 for more about databases), to cross-reference your data, or do some other task during the search that can’t be done with a plain vanilla mail client. In that case, text processing combined with any other technique can be an incredibly useful way to find information that may not be easy to find any other way.
Navigating the File System with the os Module
The os module and its submodule os.path are one of the most helpful things about using Python for a lot of day-to-day tasks that you have to perform on a lot of different systems. If you often need to write scripts and programs on either Windows or Unix that would still work on the other operating system, you know from Chapter 8 that Python takes care of much of the work of hiding the differences between how things work on Windows and Unix.
In this chapter, we’re going to completely ignore a lot of what the os module can do (ranging from process control to getting system information) and just focus on some of the functions useful for working with files and directories. Some things you’ve been introduced to already, while others are new.
One of the difficult and annoying points about writing cross-platform scripts is the fact that directory names are separated by backslashes (\) under Windows, but forward slashes (/) under Unix. Even breaking a full path down into its components is irritatingly complicated if you want your code to work under both operating systems.
Furthermore, Python, like many other programming languages, makes special use of the backslash character to indicate special text, such as \n for a newline. This complicates your scripts that create file paths on Windows.
With Python’s os.path module, however, you get some handy functions that will split and join path names for you automatically with the right characters, and they’ll work correctly on any OS that Python is running on (including the Mac.) You can call a single function to iterate through the directory structure and call another function of your choosing on each file it finds in the hierarchy. You’ll be seeing a lot of that function in the examples that follow, but first let’s look at an overview of some of the useful functions in the os and os.path modules that you’ll be using.
178 |
TEAM LinG |
|
|
|
Text Processing |
|
|
|
|
|
Function Name, as Called |
Description |
|
|
|
||
|
|
|
|
|
os.getcwd() |
Returns the current directory. You can think of this function |
|
|
|
as the basic coordinate of directory functions in whatever |
|
|
|
language. |
|
|
os.listdir(directory) |
Returns a list of the names of files and subdirectories stored |
|
|
|
in the named directory. You can then run os.stat() on the |
|
|
|
individual files — for example, to determine which are files |
|
|
|
and which are subdirectories. |
|
|
os.stat(path) |
Returns a tuple of numbers, which give you everything you |
|
|
|
could possibly need to know about a file (or directory). These |
|
|
|
numbers are taken from the structure returned by the ANSI C |
|
|
|
function of the same name, and they have the following mean- |
|
|
|
ings (some are dummy values under Windows, but they’re in |
|
|
|
the same places!): |
|
|
|
st_mode: |
permissions on the file |
|
|
st_ino: |
inode number (Unix) |
|
|
st_dev: |
device number |
|
|
st_nlink: |
link number (Unix) |
|
|
st_uid: |
userid of owner |
|
|
st_gid: |
groupid of owner |
|
|
st_size: |
size of the file |
|
|
st_atime: |
time of last access |
|
|
st_mtime: |
time of last modification |
|
|
st_ctime: |
time of creation |
|
os.path.split(path) |
Splits the path into its component names appropriately for the |
|
|
|
current operating system. Returns a tuple, not a list. This |
|
|
|
always surprises me. |
|
|
os.path.join(components) |
Joins name components into a path appropriate to the current |
|
|
|
operating system |
|
|
|
|
Table continued on following page |
179
TEAM LinG
Chapter 11
Function Name, as Called |
Description |
|
|
os.path.normcase(path) |
Normalizes the case of a path. Under Unix, this has no effect |
|
because filenames are case-sensitive; but under Windows, |
|
where the OS will silently ignore case when comparing file- |
|
names, it’s useful to run normcase on a path before comparing |
|
it to another path so that if one has capital letters, but the other |
|
doesn’t, Python will be able to compare the two the same way |
|
that the operation system would — that is, they’d be the same |
|
regardless of capitalizations in the path names, as long as that’s |
|
the only difference. Under Windows, the function returns a |
|
path in all lowercase and converts any forward slashes into |
|
backslashes. |
os.path.walk(start, function, arg) |
This is a brilliant function that iterates down through a direc- |
|
tory tree starting at start. For each directory, it calls the function |
|
function like this: function(arg, dir, files), where the arg is any |
|
arbitrary argument (usually something that is modified, like a |
|
dictionary), dir is the name of the current directory, and files is |
|
a list containing the names of all the files and subdirectories in |
|
that directory. If you modify the files list in place by removing |
|
some subdirectories, you can prevent os.path.walk() from |
|
iterating into those subdirectories. |
|
|
There are more functions where those came from, but these are the ones used in the example code that follows. You will likely use these functions far more than any others in these modules. Many other useful functions can be found in the Python module documentation for os and os.path.
Try It Out |
Listing Files and Playing with Paths |
The best way to get to know functions in Python is to try them out in the interpreter. Try some of the preceding functions to see what the responses will look like.
1.From the Python interpreter, import the os and os.path modules:
>>>import os, os.path
2.First, see where you are in the file system. This example is being done under Windows, so your mileage will vary:
>>>os.getcwd()
‘C:\\Documents and Settings\\michael’
3.If you want to do something with this programmatically, you’ll probably want to break it down into the directory path, as a tuple (use join to put the pieces back together):
>>>os.path.split (os.getcwd())
(‘C:\\Documents and Settings’, ‘michael’)
180 |
TEAM LinG |