Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Assembly Language Step by Step Programming with DOS and Linux 2nd Ed 2000.pdf
Скачиваний:
156
Добавлен:
17.08.2013
Размер:
4.44 Mб
Скачать

The make Utility and Dependencies

If you've done any programming in C at all, you're almost certainly familiar with the idea of the make utility. The make mechanism grew up in the C world, and although it's been adopted by many other programming languages and environments, it's never been adopted quite as thoroughly (or as nakedly) as in the C world.

What the make mechanism does is build executable program files from their component parts. Like gcc, the make utility is a puppet master that executes other programs according to a master plan, which is a simple text file called a make file. The make file (which by default is named "makefile") is a little like a computer program in that it specifies how something is to be done. But unlike a computer program, it doesn't specify the precise sequence of operations to be taken. What it does is specify what pieces of a program are required to build other pieces of the program, and in doing so ultimately defines what it takes to build the final executable file. It does this by specifying certain rules called dependencies.

Dependencies

Throughout this book we've been looking at teeny little programs with a hundred lines of code or less. In the real world, useful programs can take thousands, tens of thousands, or even millions of lines of source code. (The current release of Linux represents about 10 million lines of source code, depending on how you define what's a "part" of Linux. At last realizing that program bugs increase at least linearly with the size of a program's source code suite, Microsoft has stopped bragging about how many lines of code it took to create Windows NT. In truth, I'm not sure I want to know.) Managing such an immense quantity of source code is the central problem in software engineering. Making programs modular is the oldest and most-used method of dealing with program complexity. Cutting up a large program into smaller chunks and working on the chunks separately helps a great deal. In ambitious programs, some of the chunks are further cut into even smaller chunks, and sometimes the various chunks are written in more than one programming language. Of course, that creates the additional challenge of knowing how the chunks are created and how they all fit together. For that you really need a blueprint.

A make file is such a blueprint.

In a modular program, each chunk of code is created somehow, generally by using a compiler or an assembler and a linker. Compilers, assemblers, and linkers take one or more files and create new files from them. An assembler, as you've learned, takes a .asm file full of assembly language source code and uses it to create a linkable object code file or (in some cases) an executable program file. You can't create the object code file without having and working with the source code file. The object code file depends on the source code file for its very existence.

Similarly, a linker connects multiple object code files into a single executable file. The executable file depends on the existence of the object code files for its existence. The contents of a make file specify which files are necessary to create which other files, and what steps are necessary to accomplish that creation. The make utility looks at the rules (called dependencies) in the make file and invokes whatever compilers, assemblers, and other utilities it sees are necessary to build the final executable or library file.

There are numerous flavors of make utilities, and not all make files are comprehensible to all make utilities everywhere. The Unix make utility is pretty standard, however, and the one that comes with Linux is the one we'll be discussing here.

Let's take an example that actually makes a simple Linux assembly program. Typically, in creating a make file, you begin by determining which file or files are necessary to create the executable program file. The executable file is created in the link step, so the first dependency you have to define is which files the linker requires to create the executable file. As I explained earlier in this chapter, under Linux the link step is handled for us by the GNU C compiler, gcc. (Turn back to Figure 12.1 and the associated discussion if it's still fuzzy as to why a C compiler is required to link an assembly program.) The dependency itself can be pretty simply stated:

eatlinux: eatlinux.o

All this says is that to generate the executable file eatlinux, we first need to have the file eatlinux.o. The line is actually a dependency line written as it should be for inclusion in a make file. In any but the smallest programs (such as this one) the linker will have to link more than one .o file. So this is probably the simplest possible sort of dependency: One executable file depends on one object code file. If there are additional files that must be linked to generate the executable file, these are placed in a list, separated by spaces:

linkbase: linkbase.o linkparse.o linkfile.o

This line tells us that the executable file linkbase depends on three object code files, and all three of these files must exist before we can generate the executable file that we want.

Lines like these tell us what files are required, but not what must be done with them. That's an essential part of the blueprint, and it's handled in a line that follows the dependency line. The two lines work together. Here's both lines for our simple example:

eatlinux: eatlinux.o

gcc eatlinux.o -o eatlinux

The second line is indented by custom. The two lines together should be pretty easy to understand: The first line tells us what file or files are required to do the job. The second line tells us how the job is to be done: in this case, by using gcc to link eatlinux.o into the executable file eatlinux.

Nice and neat: We specify which files are necessary and what has to be done with them. The make mechanism, however, has one more very important aspect: knowing whether the job as a whole actually has to be done at all.

When a File Is Up to Date

It may seem idiotic to have to come out and say so, but once a file has been compiled or linked, it's been done, and it doesn't have to be done again . . . until we modify one of the required source or object code files. The make utility knows this. It can tell when a compile or a link task needs to be done at all, and if the job doesn't have to be done, make will refuse to do it.

How does make know if the job needs doing? Consider this dependency:

eatlinux: eatlinux.o

Make looks at this and understands that the executable file eatlinux depends on the object code file eatlinux.o, and that you can't generate eatlinux without having eatlinux.o. It also knows when both files were last changed, and if the executable file eatlinux is newer than eatlinux.o, it deduces that any changes made to eatlinux.o are already reflected in eatlinux. (It can be absolutely sure of this because the only way to generate eatlinux is by processing eatlinux.o.)

The make utility pays close attention to Linux timestamps. Whenever you edit a source code file, or generate an object code file or an executable file, Linux updates that file's timestamp to the moment that the changes were finally completed. And even though you may have created the original file six months ago, by convention we say that a file is newer than another if the time value in its timestamp is more recent than that of another file, even one that was created only 10 minutes ago.

(In case you're unfamiliar with the notion of a timestamp, it's simply a value that an operating system keeps in a file system directory for every file in the directory. A file's timestamp is updated to the current clock time whenever the file is changed.)

When a file is newer than all of the files that it depends upon (according to the dependencies called out in the make file), that file is said to be up to date. Nothing will be accomplished by generating it again, because all information contained in the component files is reflected in the dependent file.

Chains of Dependencies

So far, this may seem like a lot of fuss to no great purpose. But the real value in the make mechanism begins to appear when a single make file contains chains of dependencies. Even in the simplest make files, there will be dependencies that depend on other dependencies. Our completely trivial example program requires two dependency statements in its make file.

Consider that the following dependency statement specifies how to generate an executable file from an object code (.o) file:

eatlinux: eatlinux.o

gcc eatlinux.o -o eatlinux

The gist here is that to make eatlinux, you start with eatlinux.o and process it according to the recipe in the second line. Okay, . . . so where does eatlinux.o come from? That requires a second dependency statement:

eatlinux.o: eatlinux.asm

nasm -f elf eatlinux.asm

Here we explain that to generate eatlinux.o, we need eatlinux .asm . . . and to generate it we follow the recipe in the second line. The full makefile would contain nothing more than these two dependencies:

eatlinux: eatlinux.o

gcc eatlinux.o -o eatlinux

eatlinux.o: eatlinux.asm

nasm -f elf eatlinux.asm

These two dependency statements define the two steps that we must take to generate an executable program file from our very simple assembly language source code file eatlinux.asm. However, it's not obvious from the two dependencies I show here that all the fuss is worthwhile. Assembling eatlinux.asm pretty much requires that we link eatlinux.o to create eatlinux. The two steps go together in virtually all cases.

But consider a real-world programming project, in which there are hundreds of separate source code files. Only some of those files might be "on the rack" and undergoing change on any given day. However, to build and test the final program, all of the files are required. But . . . are all the compilation steps and assembly steps required? Not at all.

An executable program is knit together by the linker from one or more-often many more-object code files. If all but (let's say) two of the object code files are up to date, there's no reason to compile the other 147 source code files. You just compile the two source code files that have been changed, and then link all 149 object code files into the executable.

The challenge, of course, is correctly remembering which two files have changed-and being sure that all changes that have been recently made to any of the 149 source code files are reflected in the final executable file. That's a lot of remembering, or referring to notes. And it gets worse when more than one person is working on the project, as will be the case in nearly all commercial software development projects. The make utility makes remembering any of this unnecessary. Make figures it out and does only what must be done, no more, no less.

The make utility looks at the make file, and it looks at the timestamps of all the source code and object code files called out in the make file. If the executable file is newer than all of the object code files, nothing needs to be done. However, if any of the object code files are newer than the executable file, the executable file must be relinked. And if one or more of the source code files are newer than either the executable file or their respective object code files, some compiling must be done before any linking is done.

What make does is start with the executable file and look for chains of dependency moving away from that. The executable file depends on one or more object files, which depend on one or more source code files, and make walks the path up the various chains, taking note of what's newer than what and what must be done to put it all right. Make then executes the compiler, assembler, and linker selectively to be sure that the executable file is ultimately newer than all of the files that it depends on. Make

ensures that all work that needs to be done gets done. Furthermore, make avoids spending unnecessary time compiling and assembling files that are already up to date and do not need to be compiled or assembled. Given that a full build (by which I mean the recompilation and relinking of every single file in the project) can take several hours on an ambitious program, make saves an enormous amount of idle time when all you need to do is test changes made to one small part of the program.

There is actually a lot more to the Unix make facility than this, but what I've described are the fundamental principles. You have the power to make compiling conditional, inclusion of files conditional, and much more. You won't need to fuss with such things on your first forays into assembly language (or C programming, for that matter), but it's good to know that the power is there as your programming skills improve and you take on more ambitious projects.

Using make from within EMACS

The EMACS source code editor has the power to invoke the make facility without forcing you to leave the editor. This means that you can change a source code file in the editor and then compile it without dropping back out to the Linux shell. EMACS has a command called Compile, which is an item in its Tools menu. When you select Tools | Compile, EMACS will place the following command in the command line at the bottom of its window and wait for you to do something:

compile command: make -k

You can add additional text to this command line, you can backspace over it and delete parts of it (like the -k option), or you can press Enter and execute the command as EMACS wrote it. In most cases (especially while you're just getting started) all you need to do is press Enter.

Here's what happens: EMACS invokes the make utility. Unless you typed another name for the make file, make assumes that the make file will be called "makefile." The -k option instructs make to stop building any file in which an error occurs and to leave the previous copy of the target file undisturbed. If this doesn't make sense to you right now, don't worry-it's a good idea to use -k until you're really sure you don't need to. EMACS places it on the command line automatically, and you have to explicitly backspace over it to make it go away.

When it invokes make, EMACS opens a new text buffer and pipes all text output from the make process into that buffer. It will typically split your EMACS window so that the make buffer window is below the buffer you were in when you selected Tools | Compile. This allows you to see the progress of the make operation (including any error or warning messages) without leaving EMACS.

Of course, if make determines that the executable file is up to date, it will do nothing beyond displaying a message to that effect:

make: 'eatlinux' is up to date.

If you're using EMACS in an X Window window (which is what will happen automatically if you have X Window running when you invoke EMACS), you can switch from window to window by clicking with the mouse on the window you want to work in. This way you can click your way right back to the window in which you're editing source code.

One advantage to having make pipe its output into an EMACS buffer is that you can save the buffer to a text file for later reference. To do this, just keep the cursor in the make window, select the Files | Save Buffer As command, and then give the new buffer file a name.

Understanding AT&T Instruction Mnemonics

I've alluded a time or two in this book to the fact that there is more than one set of mnemonics for the x86 instructions set. There is only one set of machine instructions, but the machine instructions are pure binary bit patterns that were never intended for human consumption. A mnemonic is just that: a way for human beings to remember what the binary bit pattern 1000100111000011 means to the CPU. Instead of writing 16 ones and zeros in a row (or even the slightly more graspable hexadecimal equivalent $89 $C3), we say

MOV BX,AX.

Keep in mind that mnemonics are just that—memory joggers for humans—and are creatures unknown to the CPU itself. Assemblers translate mnemonics to machine instructions. Although we can agree among ourselves that MOV BX,AX will translate to 1000100111000011, there's nothing magical about the string

MOV BX,AX. We could as well have agreed on "COPY AX TO BX" or "STICK GPREGA INTO GPREGB." We use MOV BX,AX because that was what Intel suggested we do, and since it designed and manufactures the CPU chips, we feel that it has no small privilege in such matters.

There is another set of mnemonics for the x86 processors, and, as luck would have it, those mnemonics predominate in the Linux world. They didn't come about out of cussedness or contrariness, but because the people who originally created Unix also wished to create a family of nearly portable assemblers to help implement Unix on new platforms. I say "nearly portable" because a truly portable assembler is impossible. (Supposedly, the C language originated as an attempt to create a genuinely portable assembler notation—which, of course, is the definition of a higher-level language.) What they did do was create a set of global conventions that all assemblers within the Unix family would adhere to, and thus make creating a CPU-specific assembler faster and less trouble. These conventions actually predate the creation of the x86 processors themselves.

When gcc compiles a C source code file to machine code, what it really does is translate the C source code to assembly language source code, using what most people call the AT&T mnemonics. (Unix was created at AT&T in the sixties, and the assembler conventions for Unix assemblers were defined there as well.) Look back to Figure 12.1. The gcc compiler takes as input a .c source code file, and outputs a .s assembly source file, which is then handed to the GNU assembler gas for assembly. This is the way the GNU tools work on all platforms. In a sense, assembly language is an intermediate language used mostly for the C compiler's benefit. In most cases, programmers never see it and don't have to fool with it.

In most cases. However, if you're going to deal with the GNU debugger gdb at a machine-code level (rather than at the C source code level), the AT&T mnemonics will be in your face at every single step of the way, heh-heh. In my view the usefulness of gdb is greatly reduced by its strict dependence on the AT&T instruction mnemonics. I keep looking for somebody to create a DEBUG-style debugger for Linux that uses Intel's own mnemonics, but so far I've come up empty.

Therefore, it would make sense to become at least passingly familiar with the AT&T mnemonic set. There are some general rules that, once digested, make it much easier. Here's the list in short form:

AT&T mnemonics and register names are invariably in lowercase. This is in keeping with the Unix convention of case sensitivity, and at complete variance with the Intel convention of uppercase for assembly language source. I've mixed uppercase and lowercase in the text and examples to get you used to seeing assembly source both ways, but you have to remember that while Intel (and hence NASM) suggests uppercase but will accept lowercase, AT&T requires lowercase.

Register names are always preceded by the percent symbol, %. That is, what Intel would write as AX or EBX, AT&T would write as %ax and %ebx. This helps the assembler recognize register names.

Every AT&T machine instruction mnemonic that has operands has a single-character suffix indicating how large its operands are. The suffix letters are b, w, and l, indicating byte (8 bits), word (16 bits), or long (32 bits). What Intel would write as MOV BX,AX, AT&T would write as movw %ax,%bx. (The changed order of %ax and %bx is not an error. See the next rule!)

In the AT&T syntax, source and destination operands are placed in the opposite order from Intel syntax. That is, what Intel would write as MOV BX,AX, AT&T would write as movw %ax,%bx. In other words, in AT&T syntax, the source operand comes first, followed by the destination. This actually makes a little more sense than Intel conventions, but confusion and errors are inevitable.

In the AT&T syntax, immediate operands are always preceded by the dollar sign, $. What Intel would write as PUSH DWORD 32, AT&T would write as pushl $32. This helps the assembler recognize immediate operands.

AT&T documentation refers to "sections" where we would say "segments." A segment override is thus a section override in AT&T parlance. This doesn't come into play much because segments are not a big issue in 32-bit flat model programming. Still, be aware of it.

Not all Intel instruction mnemonics have AT&T equivalents. JCXZ, JECXZ, LOOP, LOOPZ, LOOPE, LOOPNZ, and LOOPNE do not exist in the AT&T mnemonic set, and gcc never generates code that uses them. This won't be a problem for us, as we're using NASM, but you won't see these instructions in gdb displays.

In the AT&T syntax, displacements in memory references are signed quantities placed outside parentheses containing the base, index, and scale values. I'll treat this one separately a little later, as you'll see it a lot in .s files and you should be able to understand it.

There are a handful of other issues that would be involved in programs more complex than we'll take up in this book. These mostly involve near versus far calls and jumps and their associated return instructions.

Examining gas Source Files Created by gcc

The best way to get a sense for the AT&T assembly syntax is to look at an actual AT&T-style .s file produced by gcc. Doing this actually has two benefits: First of all, it will help you become familiar with the AT&T mnemonics and formatting conventions. In addition, you may find it useful, when struggling to figure out how to call a C library function from assembly, to create a short C program that calls the function of interest and then examines the .s file that gcc produces when it compiles your C program. The dateis.c program which follows was part of my early research, and I used it to get a sense for how ctime() was called at the assembly level. Obviously, for this trick to work you must have at least a journeyman understanding of the AT&T mnemonics. (I discuss ctime() and other time-related C library calls in detail in the next chapter.)

You don't automatically get a .s file every time you compile a C program. The .s file is created, but once gas assembles the .s file to a binary object code file (typically a .o file), it deletes the .s file. If you want to examine a .s file created by gcc, you must compile with the -S option. (Note that this is an uppercase S. Case matters big time in the Unix world!) The command would look like this:

gcc dateis.c -S

Note that the output of this command is the assembly source file only. If you specify the -S option, gcc understands that you want to generate assembly source rather than an executable program file, so all it will generate is the .s file. To compile a C program to an executable program file, you must compile it again without the -S option.

Here's dateis.c. It does nothing more than print out the date and time as returned by the standard C library function ctime():

#include <time.h> #include <stdio.h> int main()

{

time_t timeval;

(void)time(&timeval);

printf("The date is: %s", ctime(&timeval)); exit(0);

}

It's not much of a program, but it does illustrate the use of three C library function calls, time(), ctime(), and printf(). When gcc compiles the preceding program (dateis.c), it produces the file dateis.s, which follows. I have manually added the equivalent Intel mnemonics as comments to the right of the AT&T mnemonics, so

you can see what equals what in the two systems. (Alas, neither gcc nor any other utility I have ever seen will do this for you!)

.file

"dateis.c"

.version

"01.01"

 

gcc2_compiled.:

 

 

.section .rodata

 

.LC0:

"The date is: %s"

.string

.text

 

 

.align 4

 

 

.globl main

main,@function

.type

main:

 

# push ebp

pushl %ebp

 

movl %esp,%ebp

# mov ebp,esp

subl $4,%esp

# sub esp,4

leal -4(%ebp),%eax

# lea eax,ebp-4

pushl %eax

 

# push eax

call time

 

# call time

addl $4,%esp

# add esp,4

leal -4(%ebp),%eax

# lea eax,ebp-4

pushl %eax

 

# push eax

call ctime

 

# call ctime

addl $4,%esp

# add esp,4

movl %eax,%eax

# mov eax,eax

pushl %eax

 

# push eax

pushl $.LC0

 

# push dword .LC0

call printf

 

# call printf

addl $8,%esp

# add esp,8

pushl $0

 

# push dword 0

call exit

 

# call exit

addl $4,%esp

# add esp,4

.p2align 4,,7

 

.L1:

 

# leave

leave

 

ret

# ret

.Lfe1:

main,.Lfe1-main

.size

.ident

"GCC: (GNU) egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)"

One thing to keep in mind when reading this is that dateis.s is assembly language code produced mechanically by a compiler, and not by a human programmer! Some things about the code (such as why the label .L1 is present but never referenced) are less than ideal and can only be explained as artifacts of gcc's compilation machinery. In a more complex program there may be some customary use of a label .L1 that doesn't exist in a program this simple.

Some quick things to note here while reading the preceding listing:

When an instruction does not take operands (call, leave, ret), it does not have an operand-size suffix. Calls and returns look pretty much alike in both Intel and AT&T syntax.

When referenced, the name of a message string is prefixed by a dollar sign ($) the same way that numeric literals are. In NASM, a named string variable is considered a variable and not a literal. This is just another AT&T peccadillo to be aware of.

Note that the comment delimiter in the AT&T scheme is the pound sign (#) rather than the semicolon used in nearly all Intel-style assemblers, including NASM.

AT&T Memory Reference Syntax

As you'll remember from earlier chapters, referencing a memory location (as distinct from referencing its address) is done by enclosing the location of the address in square brackets, like so:

mov ax, dword [ebp]

Here, we're taking whatever 32-bit quantity is located at the address contained in EBP and loading it into register AX. The x86 processors allow a number of different ways of specifying the address. To a core address called a base we can add another register called an index, and to that a constant value called a displacement. We used this sort of addressing to locate a string within a table of strings back in Chapter 11. Such addressing modes can look like this:

mov

eax, dword [ebx-4]

; Base minus displacement

mov

al, byte [bx+di+28]

; Base plus index plus displacement

I haven't really covered this, but you can add an additional factor to the index called a scale, which is a power of two by which you multiply the index:

mov al, byte [bx+di*4]

The scale can't be any arbitrary value, but must be one of 2, 4, or 8. (The value 1 is legal but doesn't accomplish anything useful.) This mode, called scaled indexed addressing, is only available in 32-bit flat model and will not work in 16-bit modes at all—which is why I haven't mentioned it in this book before now.

All of the examples I've shown you so far use the Intel syntax. The AT&T syntax for memory addressing is considerably different. In place of square brackets, AT&T uses parentheses to enclose the components of a memory address:

movb (%ebx),%al

# mov byte al,[ebx] in Intel syntax

Here, we're moving the byte quantity at [ebx] to AL. (Don't forget that the order of operands is reversed from what Intel syntax does!) Inside the parentheses you place the base, the index, and the scale, when present. (The base must always be there.) The displacement, when one exists, must go in front of and outside the parentheses:

movl

-4(%ebx),%eax

#

mov

dword eax,[ebx-4] in Intel syntax

movb

28(%ebx,%edi),%eax

#

mov

byte eax,[ebx+edi+28] in Intel syntax

Note that in AT&T syntax, you don't do the math inside the parentheses. The base, index, and scale are separated by commas, and plus signs and asterisks are not allowed. The schema for interpreting an AT&T memory reference is as follows:

±disp(base,index,scale)

The ± symbol indicates that the displacement is signed; that is, it may be either positive or negative, to indicate whether the displacement value is added to or subtracted from the rest of the address. Typically, you only see the sign as explicitly negative; without the minus symbol, the assumption is that the displacement is positive. The displacement value is optional. You may omit it entirely if there's no displacement in the memory reference. Similarly, you may omit the scale if there is no scale value present in the effective address.

What you will see most of the time, however, is a very simple type of memory reference:

-16(%ebp)

The displacements will vary, of course, but what this almost always means is that an instruction is referencing a data item somewhere on the stack. C code allocates its variables on the stack, in a stack frame, and then references those variables by constant offsets from the value in EBP. EBP acts as a "thumb in the stack," and items on the stack may be referenced in terms of offsets (either positive or negative) away from EBP. The preceding reference would tell a machine instruction to work with an item at

the address in EBP minus 16 bytes.

I have a lot more to say about stack frames in the next chapter.