Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Assembly Language Step by Step Programming with DOS and Linux 2nd Ed 2000.pdf
Скачиваний:
156
Добавлен:
17.08.2013
Размер:
4.44 Mб
Скачать

Using Type Specifiers

Back on the sample reference appendix page (see page 212), notice the following example uses of the

NEG instruction:

NEG BYTE [BX] ; Negates byte quantity at DS:BX

NEG WORD [DI] ; Negates word quantity at DS:BX

Why BYTE [BX]? Or WORD [DI]? Used in this way, BYTE and WORD are what we call type specifiers, and you literally can't use NEG (or numerous other machine instructions) on memory data without one or the other. They are not instructions in the same sense that NEG is an instruction. They exist in the broad class of things we call directives. Directives give instructions to the assembler. In this case, they tell the assembler how large the operand is when there is no other way for the assembler to know.

The problem is this: The NEG instruction negates its operand. The operand can be either a byte or a word; in real mode, NEG works equally well on both. But ... how does NEG know whether to negate a byte or a word? The memory data operand [BX] only specifies an address in memory, using DS as the assumed segment register. The address DS:BX points to a byte-but it also points to a word, which is nothing more than two bytes in a row somewhere in memory. So, does NEG negate the byte located at address DS:BX? Or does it negate the two bytes (a word) that start at address DS:BX?

Unless you tell it somehow, NEG has no way to know.

Telling an instruction the size of its operand is what BYTE and WORD do. Several other instructions that work on single operands only (such as INC, DEC, and NOT) have the same problem and use type specifiers to resolve this ambiguity.

Types in Assembly Language

Unlike nearly all high-level languages such as Pascal and C++, the notion of type in assembly language is almost wholly a question of size. A word is a type, as is a byte, a double word, a quad word, and so on. The assembler is unconcerned with what an assembly language variable means. (Keeping track of such things is totally up to you.) The assembler only worries about how big it is. The assembler does not want to have to try to fit 10 pounds of kitty litter in a 5-pound bag, which is impossible, nor 5 pounds of kitty litter in a 10-pound bag, which can be confusing and under some circumstances possibly dangerous.

Register data always has a fixed and obvious type, since a register's size cannot be changed. BL is one byte and BX is two bytes.

The type of immediate data depends on the magnitude of the immediate value. If the immediate value is too large to fit in a single byte, that immediate value becomes word data and you can't load it into an 8-bit register half. An immediate value that can fit in a single byte may be loaded into either a byte-sized register half or a full word-sized register; its type is thus taken from the context of the instruction in which it exists and matches that of the register data operand into which it is to be loaded. But if you try to load immediate data into a destination that's too small for it, the assembler will give you an error. Here's a trivial example:

MOV BL,0FFFFH

When it encounters this, NASM will complain by saying, "Warning: Byte value exceeds bounds." BL can hold values from 0 to 0FFH. (0 to 255). The value 0FFFFH is out of bounds because it is much larger than 0FFH.

Memory data is something else again. We've spoken of memory data so far in terms of registers holding offsets without considering the use of named memory data. I discuss named memory data in the next chapter, but in brief terms, you can define named variables in your assembly language programs using such directives as DB and DW. It looks like this:

Counter

DB 0

MixTag

DW 32

Here, Counter is a variable allocated as a single byte in memory by the DB (Define Byte) directive. Similarly, MixTag is a variable allocated as a word in memory by the DW (Define Word) directive.

By using DB, you give variable Counter a type and hence a size. You must match this type when you use the variable name Counter in an instruction to indicate memory data. The way to do this is to use the BYTE directive, as I mentioned a little earlier. This, for example, will be accepted by the assembler:

MOV BL,BYTE [Counter]

This instruction will take the current value located in memory at the address represented by the variable name Counter and will load that variable into register half BL. You might wonder: Why do I need to put the BYTE directive there? The assembler should know that Counter is 1 byte in size because it was defined using the directive DB.

In some assemblers, including Microsoft's MASM, it would. However, NASM's authors feel that it's important to be as explicit with assemblers as possible and leave little or nothing for the assembler to infer from context. So, although NASM uses the DB directive to allocate one byte of memory for the variable Counter, it does not remember that Counter takes up only one byte when you insert Counter as an operand in a machine instruction. You must build that specification into your source code, by using the BYTE directive. This will force you to think a little bit more about what you're doing at every point that you do it; that is, right where you use variable names as instruction operands. Doing so may help you avoid certain really stupid mistakes-like the ones I used to make all the time while I was working with MASM, most of which came out of trying to let the assembler do my thinking for me.

To me, this is a wonderful thing, and one of the main reasons I chose NASM as the focus of this book.

Now here's another case, one that NASM will assemble without a burp:

MOV BL,BYTE MixTag

This looks innocent enough until you remember that MixTag is actually 2 bytes (one word) in size, having been defined with the DW directive. You might think this is an error, because MixTag isn't the same size as BL. True enough-but the key is that there's no ambiguity here. The assembler knows what you want, even if what you want is peculiar. The type specifier BYTE forces the assembler to look upon MixTag as being 1 byte in size. MixTag is not byte-sized, however, so what actually happens is that the least significant (lowermost) byte of MixTag will be loaded into BL, with the most significant byte left high and dry.

Is this useful? It can be. Is it dangerous? You bet. It is up to you to decide whether overriding the type of memory data makes sense and is completely your responsibility to ensure that doing so doesn't sprinkle your code with bugs. But nothing is left for the assembler to decide. That's what type specifiers are for: to make it clear to the assembler in every case what it is supposed to do. Whether that in fact makes sense is up to you. Use your head-and know what you're doing. That's more important in assembly language than anywhere else in computer programming.

Chapter 8: Our Object All Sublime Creating

Programs that Work

Overview

They don't call it "assembly" for nothing. Facing the task of writing an assembly language program brings to mind images of Christmas morning: You've spilled 1,567 small metal parts out of a large box marked Land Shark HyperBike (Some Assembly Required) and now you have to somehow put them all together with nothing left over. (In the meantime, the kids seem more than happy playing in the box…)

I've actually explained just about all you absolutely must understand to create your first assembly language program. Still, there is a nontrivial leap from here to there; you are faced with many small parts with sharp edges that can fit together in an infinity of different ways, most wrong, some workable, but only a few that are ideal.

So here's the plan: In the following section I will present you with the completed and operable Land Shark HyperBike-which I will then tear apart before your eyes. This is the best way to learn to assemble: By pulling apart programs written by those who know what they're doing. Over the rest of this book we'll pull a few more programs apart, in the hope that by the time it's over you'll be able to move in the other direction all by yourself.

The Bones of an Assembly Language Program

The following listing is perhaps the simplest correct program that will do anything visible and still be comprehensible and expandable. This issue of comprehensibility is utterly central to quality assembly language programming. With no other computer language (not even APL or that old devil FORTH) is there anything even close to the risk of writing code that looks so much like something scraped off the wall of King Tut's tomb.

The program EAT.ASM displays one (short) line of text on your display screen:

Eat at Joe's!

For that you have to feed 28 lines of text file to the assembler. Many of those 28 lines are unnecessary in the strict sense, but serve instead as commentary to allow you to understand what the program is doing (or more important, how it's doing it) six months or a year from now.

One of the aims of assembly language coding is to use as few instructions as possible in getting the job done. This does not mean creating as short a source code file as possible. (The size of the source file has nothing to do with the size of the executable file assembled from it!) The more comments you put in your file, the better you'll remember how things work inside the program the next time you pick it up. I think you'll find it amazing how quickly the logic of a complicated assembly language file goes cold in your head. After no more than 48 hours of working on other projects, I've come back to assembler projects and had to struggle to get back to flank speed on development.

Comments are neither time nor space wasted. IBM used to say, One line of comments per line of code. That's good-and should be considered a minimum for assembly language work. A better course (that I will in fact follow in the more complicated examples later on) is to use one short line of commentary to the right of each line of code, along with a comment block at the start of each sequence of instructions that work together in accomplishing some discrete task.

Here's the program. Read it carefully:

; Source name

: EAT.ASM

; Executable name : EAT.COM

; Code model:

: Real mode flat model

; Version

: 1.0

; Created date

: 6/4/1999

; Last update

: 9/10/1999

; Author

: Jeff Duntemann

; Description

: A simple example of a DOS .COM file programmed using

;

 

NASM-IDE 1.1 and NASM 0.98.

[BITS 16]

; Set 16 bit code generation

[ORG 0100H]

; Set code start address to 100h (COM file)

[SECTION .text]

; Section containing code

START:

 

 

mov

dx, eatmsg

; Mem data ref without [] loads the ADDRESS!

mov

ah,9

; Function 9 displays text to standard output.

int

21H

; INT 21H makes the call into DOS.

mov

ax, 04C00H

; This DOS function exits the program

int

21H

; and returns control to DOS.

[SECTION .data]

; Section containing initialized data

eatmsg

db "Eat at Joe's!", 13, 10, "$" ;Here's our message

The Simplicity of Flat Model

After all our discussion in previous chapters about segments, this program might seem, um,…suspiciously simple. And indeed it's simple, and it's simple almost entirely because it's written for the 16-bit real mode flat model. (I drew this model out in Figure 6.8.) The first thing you'll notice is that there are no references to segments or segment registers anywhere. The reason for this is that in real mode flat model, you are inside a single segment, and everything you do, you do within that single segment. If everything happens within one single segment, the segments (in a sense) "factor out" and you can imagine that they don't exist. Once we assemble EAT.ASM and create a runnable program from it, I'll show you what those segment registers are up to and how it is that you can almost ignore them in real mode flat model.

But first, let's talk about what all those lines are doing.

At the top is a summary comment block. This text is for your use only. When NASM processes a .ASM file, it strips out and discards all text between any semicolon and the end of the line the semicolon is in. Such lines are comments, and they serve only to explain what's going on in your program. They add nothing to the executable file, and they don't pass information to the assembler. I recommend placing a summary comment block like this at the top of every source code file you create. Fill it with information that will help someone else understand the file you've written or that will help you understand the file later on, after it's gone cold in your mind.

Beneath the comment block is a short sequence of commands directed to the assembler. These commands are placed in square brackets so that NASM knows that they are for its use, and are not to be interpreted as part of the program.

The first of these commands is this:

[BITS 16] ; Set 16 bit code generation

The BITS command tells NASM that the program it's assembling is intended to be run in real mode, which is a 16-bit mode. Using [BITS 32] instead would have brought into play all the marvelous 32-bit protected mode goodies introduced with the 386 and later x86 CPUs. On the other hand, DOS can't run protected mode programs, so that wouldn't be especially useful.

The next command requires a little more explanation:

[ORG 0100h] ; Set code start address to 100h (COM file)

"ORG" is an abbreviation of origin, and what it specifies is sometimes called the origin of the program, which is where code execution begins. Code execution begins at 0100H for this program. The 0100h value (the h and H are interchangeable) is loaded into the instruction pointer IP by DOS when the program is loaded and run. So, when DOS turns control over to your program (scary thought, that!), the first instruction to be executed is the one pointed to by IP-in this case, at 0100H.

Why 0100H? Look back at Figure 6.8. The real mode flat model (which is often called the .COM file model) has a 256-byte prefix at the beginning of its single segment. This is the Program Segment Prefix (PSP) and it has several uses that I won't be explaining here. The PSP is basically a data buffer and contains no code. The code cannot begin until after the PSP, so the 0100H value is there to tell DOS to skip those first 256 bytes.

The next command is this:

[SECTION .text]

; Section containing code

NASM divides your programs into what it calls sections. These sections are less important in real mode flat model than in real mode segmented model, when sections map onto segments. (More on this later.) In flat model, you have only one segment. But the SECTION commands tell NASM where to look for particular types of things. In the .text section, NASM expects to find program code. A little further down the file you'll see another SECTION command, this one for the .data section. In the .data section, NASM expects to find the definitions for your initialized variables. A third section is possible, the .bss section, which contains uninitialized data. EAT.ASM does not use any uninitialized data, so this section does not exist in this program. I discuss uninitialized data later on, in connection with the stack.

Labels

The next item in the file is something called a label:

START:

A label is a sort of bookmark, holding a place in the program code and giving it a name that's easier to remember than a memory address. The START: label indicates where the program begins. Technically speaking, the START: label isn't necessary in EAT.ASM. You could eliminate the START: label and the program would still assemble and run. However, I think that every program should have a START: label as a matter of discipline. That's why EAT.ASM has one.

Labels are used to indicate where JMP instructions should jump to, and I explain that in detail later in this chapter and in later chapters. The only distinguishing characteristic of labels is that they're followed by colons. Some rules govern what constitutes a valid label:

Labels must begin with a letter or with an underscore, period, or question mark. These last three have special meanings (especially the period), so I recommend sticking with letters until you're way further along in your study of assembly language and NASM.

Labels must be followed by a colon when they are defined. This is basically what tells NASM that the identifier being defined is a label. NASM will punt if no colon is there and will not flag an error, but the colon nails it, and prevents a misspelled mnemonic from being mistaken for a label. So use the colon!

Labels are case sensitive. So yikes:, Yikes:, and YIKES: are three completely different labels. This differs from practice in a lot of languages (Pascal particularly) so keep it in mind.

Later on, we'll see such labels used as the targets of jump instructions. For example, the following machine instruction transfers the flow of instruction execution to the location marked by the label

GoHome:

JMP GoHome

Notice that the colon is not used here. The colon is only placed where the label is defined, not where it is referenced. Think of it this way: Use the colon when you are marking a location, not when you are going there.

Variables for Initialized Data

The identifier eatmsg defines a variable. Specifically, eatmsg is a string variable (more on which follows) but still, as with all variables, it's one of a class of items we call initialized data: something that comes with a value, and not just a box that will accept a value at some future time. A variable is defined by associating an identifier with a data definition directive. Data definition directives look like this:

MyByte

DB 07H

; 8 bits in size

MyWord

DW

0FFFFH

;

16

bits

in

size

MyDouble

DD

0B8000000H

;

32

bits

in

size

Think of the DB directive as "Define Byte." DB sets aside one byte of memory for data storage. Think of the DW directive as "Define Word." DW sets aside one word of memory for data storage. Think of the DD directive as "Define Double." DD sets aside a double word in memory for storage, typically for full 32-bit addresses.

I find it useful to put some recognizable value in a variable whenever I can, even if the value is to be replaced during the program's run. It helps to be able to spot a variable in a DEBUG dump of memory rather than to have to find it by dead reckoning-that is, by spotting the closest known location to the variable in question and counting bytes to determine where it is.

String Variables

String variables are an interesting special case. A string is just that: a sequence or string of characters, all in a row in memory. A string is defined in EAT.ASM:

eatmsg

DB "Eat at Joe's!", 13, 10, "$" ;Here's our message

Strings are a slight exception to the rule that a data definition directive sets aside a particular quantity of memory. The DB directive ordinarily sets aside one byte only. However, a string may be any length you like, as long as it remains on a single line of your source code file. Because there is no data directive that sets aside 16 bytes, or 42, strings are defined simply by associating a label with the place where the string starts. The eatmsg label and its DB directive specify one byte in memory as the string's starting point. The number of characters in the string is what tells the assembler how many bytes of storage to set aside for that string.

Either single quote (') or double quote (") characters may be used to delineate a string, and the choice is up to you, unless you're defining a string value that itself contains one or more quote characters. Notice in EAT.ASM the string variable eatmsg contains a single-quote character used as an apostrophe. Because the string contains a single-quote character, you must delineate it with double quotes. The reverse is also true: If you define a string that contains one or more double-quote characters, you must delineate it with single-quote characters:

Yukkh

DB

'He said, "How disgusting!" and threw up.',"$"

You may combine several separate substrings into a single string variable by separating the substrings with commas. Both eatmsg and Yukkh do this. Both add a dollar sign ($) in quotes to the end of the main string data. The dollar sign is used to mark the end of the string for the mechanism that displays the string to the screen. More on that mechanism and marking string lengths in a later section.

What, then, of the "13,10" in eatmsg? This is the carriage return and linefeed pair I discussed in an earlier chapter. Inherited from the ancient world of electromechanical Teletype machines, these two characters are recognized by DOS as meaning the end of a line of text that is output to the screen. If anything further is output to the screen, it will begin at the left margin of the next line below. You can concatenate such individual numbers within a string, but you must remember that they will not appear as numbers. A string is a string of characters. A number appended to a string will be interpreted by most operating system routines as an ASCII character. The correspondence between numbers and ASCII characters is shown in Appendix D.

Directives versus Instruction Mnemonics

Data definition directives look a little like machine instruction mnemonics, but they are emphatically not machine instructions. One very common mistake made by beginners is looking for the binary opcode represented by a directive such as DB or DW. There is no binary opcode for DW, DB, and the other directives. Machine instructions, as the name implies, are instructions to the CPU itself. Directives, by contrast, are instructions to the assembler.

Understanding directives is easier when you understand the nature of the assembler's job. (Look back to Chapter 4 for a detailed refresher if you've gotten fuzzy on what assemblers and linkers do.) The assembler scans your source code text file, and as it scans your source code file it builds an object code file on disk. It builds this object code file step by step, one byte at a time, starting at the beginning of the file and working its way through to the end. When it encounters a machine instruction mnemonic, it figures out what binary opcode is represented by that mnemonic and writes that binary opcode (which may be anywhere from one to six actual bytes) to the object code file.

When the assembler encounters a directive such as DW, it does not write any opcode to the object code file. DW is a kind of signpost to the assembler, reading "Set aside two bytes of memory right here, for the value that follows." The DW directive specifies an initial value for the variable, and so the assembler writes the bytes corresponding to that value in the two bytes it set aside. The assembler writes the address of the allocated space into a table, beside the label that names the variable. Then the assembler moves on, to the next directive (if there are further directives) or to whatever comes next

in the source code file.

For example, when you write the following statement in your assembly language program:

MyVidOrg DW 0B800H

what you are really doing is instructing the assembler to set aside two bytes of data (Define Word, remember) and place the value 0B800H in those two bytes. The assembler writes the identifier

MyVidOrg and the variable's address into a table it builds of identifiers (both labels and variables) in the program for later use by other elements of the program, or the linker.

The Difference between a Variable's Address and Its Contents

I've left discussion of EAT.ASM's machine instructions for last-at least in part because they're easy to explain. All that EAT.ASM does, really, is hand a string to DOS and tell DOS to display it on the screen by sending it to something called standard output. It does this by passing the address of the string to DOS-not the character values contained in the string itself. This is a crucial distinction that trips up a lot of beginners. Here's the first instruction in EAT.ASM:

mov dx, eatmsg

; Mem data ref without [] loads the ADDRESS!

If you look at the program, you can see that while DX is 2 bytes in size, the string eatmsg is 15 bytes in size. At first glance, this MOV instruction would seem impossible-but that's because what's being moved is not the string itself, but the string's address, which (in the real mode flat model) is 16 bits-2 bytes-in size. The address will thus fit nicely in DX.

When you place a variable's identifier in a MOV instruction, you are accessing the variable's address, as explained previously. By contrast, if you want to work with the value stored in that variable, you must place the variable's identifier in square brackets. Suppose you had defined a variable in the .data section called MyData this way:

MyData DW 0744H

The identifier MyData represents some address in memory, and at that address the assembler places the value 0744H. Now, if you want to copy the value contained in MyData to the AX register, you would use the following MOV instruction:

MOV AX,[MyData]

After this instruction, AX would contain 0744H.

There are many situations in which you need to move the address of a variable into a register rather than the contents of the variable. In fact, you may find yourself moving the addresses of variables around more than the contents of the variables, especially if you make a lot of calls to DOS and BIOS services.

If you've used higher-level languages such as Basic and Pascal, this distinction may seem inane. After all, who would mistake the contents of a variable for its location? Well, that's easy for you to say-in Basic and Pascal you rarely if ever even think about where a variable is. The language handles all that rigmarole for you. In assembly language, knowing where a variable is located is essential in order to do lots of important things.

Making DOS Calls

What EAT.ASM really does, as I mentioned previously, is call DOS and instruct DOS to display a string located at a particular place in memory. The string itself doesn't go anywhere; EAT.ASM tells DOS where the string is located, and then DOS reaches up into your .data section and does what it must with the string data.

Calling DOS is done with something called a software interrupt. I explain these in detail later in this

chapter. But if you look at the code you can get a sense for what's going on:

mov

dx,

eatmsg

; Mem

data ref without [] loads the ADDRESS!

mov

ah,9

;

Function 9 displays text to

standard output.

int

21H

 

;

INT

21H makes the call into

DOS.

Here, the first line loads the address of the string into register DX. The second line simply loads the constant value 9 into register AH. The third line makes the interrupt call, to interrupt 21H.

The DOS call has certain requirements that must be set up before the call is made. It must know what particular call you want to make, and each call has a number. This number must be placed in AH and, in this case, is call 09H (Display String). For this particular DOS call, DOS expects the address of the string to be displayed to be in register DX. If you satisfy those two conditions, you can make the DOS software interrupt call INT 21H-and there's your string on the screen!

Exiting the Program and Setting ERRORLEVEL

Finally, the job is done, Joe's has been properly advertised, and it's time to let DOS have the machine back. Another DOS service, 4CH (Terminate Process), handles the mechanics of courteously disentangling the machine from EAT.ASM's clutches. Terminate Process doesn't need the address of anything, but it will take whatever value it finds in the AL register and place it in the ERRORLEVEL DOS variable. DOS batch programs can test the value of ERRORLEVEL and branch on it.

EAT.ASM doesn't do anything worth testing in a batch program, but if ERRORLEVEL will be set anyway, it's a good idea to provide some reliable and harmless value for ERRORLEVEL to take. This is why 0 is loaded into AL prior to ending it all by the final INT 21 instruction. If you were to test ERRORLEVEL after running EAT.EXE, you would find it set to 0 in every case.

That's really all there is to EAT.ASM. Now let's see what it takes to run it, and then let's look more closely at its innards in memory.