Добавил:

Andrey Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Санкт-Петербургский государственный электротехнический университет "ЛЭТИ"

Предмет:

Электротехника

Файл:

Qiao S.Principles of floating point computation

.pdf

Скачиваний:

Добавлен:

23.08.2013

Размер:

174.96 Кб

Скачать

☆

1 / 31 2 3 > Следующая >>>

CAS727, S. Qiao

Part 1 Page 1

Principles of Floating Point

Computation

Before investigating numerical software engineering, we must research the principles of ﬁnite precision computing.

1Floating-point Number System

is a subset of the real numbers
A ﬂoating point number system F R	−
whose elements have tthe form y = ±m ×βe		t. The mantissa m is an integer
satisfying 0 ≤ m ≤ β −1.−To ensure a unique representation for each y F
it is assumed that m ≥ βt 1 if y 6= 0, so that the system is normalized. Note

that an alternative way of expressing y is y = ±βe × .d1d2...dt where each digit di satisﬁes 0 ≤ di ≤ β − 1. It is important to realize that the ﬂoating point numbers are not equally spaced. The system F is characterized by four integer parameters: the base β (also called radix); the precision t; and the exponent range emin ≤ e ≤ emax. Another measure of the precision is called machine precision denoted by M . It is deﬁned as the distance between 1.0 and the next larger ﬂoating point number. Clearly, M = β1−t.

Example 1 When β = 2, t = 3, emin = −2 and emax = 3 there are 24 normalized positive ﬂoating point numbers:

−2		1.01 × 2	−2	−2	−2
1.00 × 2−	1	1.01 × 2		1.10 × 2	1.11 × 2
1.00 × 2	1	· · ·		1.10 × 23	1.11 × 23

Figure 1 shows some of the positive ﬂoating-point numbers in our small system. Note that they are not equally spaced.

2Underﬂow and Overﬂow

Overﬂow means that the exponent is too large to be represented in the exponent ﬁeld. Underﬂow means that the exponent is too small to be represented in the exponent ﬁeld. In our small ﬂoating point system, the largest normal number Nmax is 1.11 × 23, which is called the overﬂow threshold. When the result is larger than Nmax overﬂow occurs. The smallest positive normal number Nmin is 1.00 × 2−2, which is called the underﬂow threshold. When the result is smaller than Nmin underﬂow occurs. We use the above Example 1 to illustrate overﬂow and underﬂow.

CAS727, S. Qiao						Part 1 Page 2


0	1/4	1/2	1	2	12	14

Figure 1: A small ﬂoating-point number system.

Example 2 Using the system in Example 1, When a = 1.00 × 21 and b = 1.10 × 23, then a/b causes underﬂow and a b causes overﬂow.

3IEEE Standard

The IEEE standard 754 [5], published in 1985, deﬁnes a binary ﬂoating point arithmetic system. Among the design principles of the standard are:

•encourage software developers to develop robust, e cient, and portable numerical programs,

•enable the handling of arithmetic exceptions,

•provide for the development of transcendental functions and very high precision arithmetic.

For recent status on the IEEE standard see [6]. This section provides a tour of the IEEE standard. Each subsection discusses one aspect of the standard.

Before going into the IEEE ﬂoating-point number system, we use the small number system in Example 1 to illustrate ﬂoating-point number representation.

In Example 1, all normalized numbers have a leading bit 1. So, we can save this bit, called hidden bit. Thus we need two bits for the fraction, three bits for the exponent, and one bit for the sign, a total of six bits. We deﬁne our small ﬂoating-point number system as follows.

Deﬁnition 1

A Small Floating-Point Number System

t	emax	emin	Exponent width in bits	Format width in bits
3	+3	−2	3	6

This small ﬂoating-point number system can represent 24 positive numbers and 24 negative numbers.

CAS727, S. Qiao

Part 1 Page 3

3.1Formats

Base (Radix)

In IEEE standard 754, the base is 2.

Precision

The IEEE standard deﬁnes four di erent precisions: single, double, singleextended, and double-extended. In 754, single and double precision correspond roughly to what most ﬂoating-point hardware provides. Single precision occupies a single 32 bit word, double precision two consecutive 32 bit words. Extended precision is a format that o ers at least a little extra precision and exponent range. The IEEE standard only speciﬁes a lower bound on how many extra bits extended precision provides. The minimum allowable double-extended format is sometimes referred to as 80-bit format, even though Table 1 shows it using 79 bits. The reason is that hardware implementations of extended precision normally do not use a hidden bit, and so would use 80 rather than 79 bits. This feature is available on almost all architectures. But it continues to be under-untilized for lack of linguistic support because the programming language community appears not to understand how and why this format is intended to be used. Long double is intended to support double and single the way double supports single. So, intermediate results in evaluating an expression are computed in high precision. Why? Extra-precise arithmetic attenuates the risk of error due to roundo .

Example 3 Extra precision as a way to conserve interest rate’s monotonicity [7, p. 64]. An important requirement for certain ﬁnancial computation of rates of return is monotonicity. If a small change in data causes a computed result to change, its change should not go in the wrong direction. For example, if return on an investment is increased, its computed rate of return must not decrease. The conservation of monotonicity becomes more challenging as it becomes more important during computations designed to optimize rates of return. These rates satisfy equations that can have more than one root, and the choice of the right root can be spoiled if monotonicity is lost to round o . By far the easiest way to conserve monotonicity is to compute extra precisely.

Exponent

Since the exponent can be positive or negative, some method must be chosen to represent its sign. Two common methods of representing signed numbers

CAS727, S. Qiao					Part 1 Page 4

				Extended

	Parameter	Single	Double	Single	Double

	t	24	53	≥ 32	≥ 64
	emax	+127	+1023	≥ +1023	> +16383
	emin	−126	−1022	≤ −1022	≤ −16382
	Exponent width in bits	8	11	≥ 11	≥ 15
	Format width in bits	32	64	≥ 43	≥ 79

Table 1: IEEE 754 Format Parameters

are sign/magnitude and two’s complement. The IEEE binary standard does not use either of these methods to represent the exponent, but instead uses a biased representation for e cient calculation of exponents. In the case of single precision, where the exponent is stored in 8 bits, the bias is 127 (for double precision it is 1023). What this means is that if k is the value of the exponent bits interpreted as an unsigned integer (biased), then the exponent of the ﬂoating-point number is k − 127 (unbiased).

Example 4 In our small ﬂoating-point number system (Deﬁnition 1), the bias is 3. Thus, for example, the unsigned integer (binary) 001 in the exponent ﬁeld (biased) represents exponent −2 (unbiased), which is emin.

3.2Special Quantities

The IEEE standard speciﬁes the following special values: ±0, NaNs, ±∞, and denormalized numbers. These special values are all encoded with exponents of either emax + 1 or emin − 1.

Signed Zeros

Zero is represented by the exponent emin − 1 and a zero fraction. Since the sign bit can take on two di erent values, there are two zeros, +0 and −0. The IEEE standard deﬁnes comparison so that +0 = −0, rather than −0 < +0. Although it would be possible always to ignore the sign of zero, the IEEE standard does not do so. When a multiplication or division involves a signed zero, the usual sign rules apply in computing the sign of the answer.

NaNs	√

Traditionally, the computation of 0/0 or −1 has been treated as an unrecoverable error which causes a computation to halt. However it makes sense for a computation to continue in such a situation.

CAS727, S. Qiao			Part 1 Page 5

	Operation	NaN Produced By


	+	∞ + (−∞)
		0 ∞
	/	0/0, ∞/∞
	REM	x REM 0, ∞ REM y
	sqrt	sqrt(x) when x < 0

Table 2: Operations that Produce a NaN

Example 5 Suppose there is a statement of return (−b + SQRT(d))/(2 a) in a program. If d < 0, then SQRT(d) is a NaN. In IEEE standard if a

NaN involves a ﬂoating point computation, the result is another NaN. So the return value of (−b + SQRT(d))/(2 a) is NaN, which indicates that it is not a real number. The computation continues.

In IEEE 754, NaNs are often represented as ﬂoating point numbers with exponent emax + 1 and nonzero fractions. Implementations are free to put system-dependent information into the fraction. Thus there is not a unique NaN, but rather a whole family of NaNs. In general, whenever a NaN participates in a ﬂoating point operation, the result is another NaN.

Inﬁnities

The inﬁnity symbol is represented by a zero fraction and the same exponent ﬁeld as a NaN (emax +1); the sign bit distinguishes between ±∞. The inﬁnity symbol obeys the usual mathematical conventions regarding inﬁnity, such as ∞ + ∞ = ∞, (−1) ∞ = −∞, and (ﬁnite)/∞ = 0.

Just as NaNs provide a way to continue a computation when expressions

√

like 0/0 or −1 are encountered, inﬁnities provide a way to continue when an overﬂow occurs. This is much safer than simply returning the largest representable number.

	√
Example 6 Suppose that there is a computation of		2	under our small
	x
ﬂoating-point number system (Deﬁnition 1). If x	= 1.00 × 23, then x2

overﬂows. If it is replaced by the largest number 1.11 ×23, then the result of

√

x2 is 1.00 × 22 rather than 1.00 × 23. Obviously it is far from our expected result. Adopting the IEEE standard, the result of x2 would be ∞. So the ﬁnal result of this computation would be ∞. It is safer than just returning an ordinary ﬂoating point number which is far from the correct answer.

CAS727, S. Qiao			Part 1 Page 6
0	1/4	1/2	1

Figure 2: Denormalized numbers (between 0 and 2−2 marked by ×).

Denormalized Numbers

The IEEE standard uses denormalized numbers, which guarantee x = y x − y = 0, as well as other useful relations. Speciﬁcally, if the bits in the fraction ﬁeld are b1, b2, ..., bp−1, and the value of the exponent is e, then when e > emin −1, the number being represented is 1.b1b2...bp−1 ×2e whereas when e = emin − 1, the number being represented is 0.b1b2... × 2e+1. The +1 in the exponent is needed because denormals have an exponent of emin, not

emin − 1.

Example 7 In our small ﬂoating-point systm (Deﬁnition 1) if normalized numbers x = 1.01 × 2−2 and y = 1.00 × 2−2, then x − y is too small to be represented in normalized number range and must be ﬂushed to zero (underﬂow), although x =6 y. When the denormalized number is introduced, x − y does not underﬂow, instead x − y = 0.01 × 2−2. This behavior is called gradual underﬂow (sometimes called graceful underﬂow). Figure 2 plots the positive denormalized numbers in our small ﬂoating point number system.

When denormalized numbers are added to the number line, the spacing between adjacent ﬂoating point numbers varies in a regular way: adjacent spacings are either the same length or di er by a factor of β. Without denormals, the spacing abruptly changes from β−t+1βemin to βemin , which is a factor of βt−1, rather than the orderly change by a factor of β. Because of this, many algorithms that can have large relative error for normalized numbers close to the underﬂow threshold are well-behaved in this range when gradual underﬂow is used.

Table 3 summarizes IEEE values.

3.3Correctly Rounded Operations

When we apply a ﬂoating point operation, say add, to ﬂoating point numbers, the exact result may not ﬁt the format of the destination. For example, consider in our small system c = a+b where a = 1.00×20 and b = 1.01×2−2 . The exact result is 1.01012 , which does not ﬁt the format of the destination c, a ﬂoating point number in our system. We must round the exact result to

CAS727, S. Qiao				Part 1 Page 7

	Exponent	Fraction	Represents


	e = emin − 1	f = 0	±0
	e = emin − 1	f 6= 0	0.f × 2emin
	emin ≤ e ≤ emax	−	1.f × 2e
	e = emax + 1	f = 0	∞
	e = emax + 1	f 6= 0	NaN

Table 3: IEEE 754 values

ﬁt the format of our system. For example, the ﬂoating point number that is nearest to 1.01012 is 1.01 × 20.

The IEEE standard requires that the following ﬂoating point operations are correctly rounded:

•arithmetic operations +, −, , and /

•square root and remainder

•conversions of formats

Correctly rounded means that result must be the same as if it were computed exactly and then rounded, usually to the nearest ﬂoating-point number. We will discuss the rounding modes in the next section.

The format conversions include:

•between ﬂoating point formats, e.g., from single to double.

•between ﬂoating point and integer formats.

•from ﬂoating point to integer value (not format).

•between binary and decimal.

3.4Rounding Modes

In the IEEE standard, rounding occurs whenever an operation has a result that is not exact. By default, rounding means round toward nearest. The standard requires that three other rounding modes be provided, namely round toward 0, round toward +∞, and round toward −∞.

1.Round to +∞: always round up to the ﬁrst representable ﬂoating point number.

CAS727, S. Qiao					Part 1 Page 8

	Round	→ +∞	→ −∞	→ 0	→ nearest
	1.1001 × 22	1.11 × 22	1.10 × 22	1.10 × 22	1.10 × 22

Table 4: Four rounding modes

2.Round to −∞: always round down to the ﬁrst representable ﬂoating point number.

3.Round to 0: always truncate the digits that after the last representable digit.

4.Round to nearest even: always round to the nearest ﬂoating point number. In the case of a tie, the one with its least signiﬁcant bit equal to zero is chosen.

Example 8 There are two ﬂoating point numbers: a = b = 1.01 × 21 under our small ﬂoating-point number system. The exact value of a b = 1.1001 × 22. This value is between 1.10 × 22 and 1.11 × 22. Table 4 illustrates the rounded value under di erent rounding modes.

Why dynamic directed rounding modes? Testing numerical stability. How? Di erent rounding introduces slightly di erent rounding errors. If slightly perturbed intermediate results cause signiﬁcant changes in the ﬁnal results, the program is unstable.

Example 9 Computing the area of a needle-like triangle and testing instability using di erent rounding. See [7, p. 58]

Another application of dynamic directed rounding modes is interval arithmetic.

3.5Exceptions, Flags and Trap Handlers

The IEEE standard divides exceptions into 5 classes: overﬂow, underﬂow, division by zero, and invalid operation and inexact as shown in Table 5. They are called “exceptions” because to any policy for handling them, imposed in advance upon all programmers by the computer system, some programmers will have good reasons to take exception. IEEE 754 speciﬁes a default policy for each exception, and allows system implementor the option of offering programmers an alternative policy, which is to trap with speciﬁed information about the exception to a programmer-selected trap-handler.

CAS727, S. Qiao			Part 1 Page 9

	Exception	Result when traps disabled	Argument to trap handler


	overﬂow	±∞ or ±xmax	round(x2−α)
	underﬂow	0, ±2emin or denormal	round(x2α)
	divide by zero	∞	operands
	invalid	NaN	operands

	inexact	round(x)	round(x)

Table 5: Exceptions in IEEE 754

The IEEE standard has a number of ﬂags. A ﬂag is a type of global variable raised as a side-e ect of exceptional ﬂoating-point operations. There is one status ﬂag for each of the ﬁve exceptions: underﬂow, overﬂow, division by zero, invalid operation and inexact; and four rounding modes. When any exception occurs, a status ﬂag is also set. Also it can be sensed, saved, restored and lowered by a program. Implementations of the IEEE standard are required to provide users with a way to read and write the status ﬂags. The ﬂags are “sticky” in that once set, they remain set until explicitly cleared. These modes and ﬂags can be put to good use.

IEEE 754 speciﬁes a default policy for each of ﬂoating-point exceptions:

1.Signal the event by raising an appropriate one of the ﬁve ﬂags, if it has not already been raised.

2.(Pre)substitute a default value for what would have been the result of the exceptional operation, see Table 5.

3.Resume execution of the program as if nothing exceptional had occurred.

The default policy is a reasonable thing to do. For example, if we compute the resistance of two parallel resistors R1 and R2, the total resistance is 1/(1/R1 + 1/R2). With the default values, when R1 or R2 or both equals zero, the result is zero. Without default values, we would have to check conditions. Conditional branches are ine cient on pipeline machines.

With these default values, IEEE 754’s ﬂoating-point becomes an algebraically completed system: this means the computer’s every algebraic operation produces a well-deﬁned result for all operands.

Why should computer arithmetic be algebraically completed?

Otherwise some exceptions would have to trap. Then robust programs could avert loss of control only by precluding those exceptions (at the cost of time

CAS727, S. Qiao

Part 1 Page 10

wasted pretesting operands to detect rare hazards) or else by anticipating them all and providing handlers for their traps.

Example 10 A Tale of the Ariane 5, [7, p. 22]. A European satellite-lifting rocket turned cartwheels due to the incompleted system. Strong acceleration upon launch caused conversion to integer overﬂow in the software. Lacking a handler for this unanticipated overﬂow trap, this software trapped to system. This was misinterpreted as necessitating strong corrective action.

It is not the only useful way to algebraically complete the real and complex number systems. There are other possible ways. By majority vote a committee chose the particular completion speciﬁed by IEEE 754 because it was deemed less strange than others and more likely to render exceptions ignorable.

What’s wrong with the default values speciﬁed for these exceptions by IEEE 754? For example,

x2 + 1	x > pβ βemax /2
x

then x2 overﬂows and the result is zero, obviously wrong. Although, in this case we can rewrite the formula:

x + 1/x

this is not always possible in general.

A programmer can have good reasons to take exception to that completion and to every other since they jeopardize cancellation laws or other relationships usually taken for granted. For example, x/x =6 1 if x is 0 or not ﬁnite; x − x =6 0 =6 0 · x, if x is not ﬁnite. After non-ﬁnite values have been created they may invalidate the logic underlying subsequent computation and then disappear: (ﬁnite/Overﬂow) becomes 0, (NaN < 7) becomes false,

.... Perhaps no traces will be left to arouse suspicions that plausible ﬁnal results are actually quite wrong. For example,

until (x ≥ 100)

If x ever becomes NaN, x ≥ 100 results in false and NaN disappers without traces. Another example of trap handlers is backward compatibility. Old

1 / 31 2 3 > Следующая >>>

Соседние файлы в предмете Электротехника

#
23.08.201331.07 Mб15Printed circuits handbook.Ed.Coombs C.F.2001.pdf
#
23.08.20132.04 Mб55Profibus specification 1.0.1998.PDF
#
23.08.201317.3 Mб12Pruitt J.Teach yourself Gimp in 24 hours.1999.pdf
#
23.08.2013195.66 Кб5Pustejovsky J.The specification language TimeML.pdf
#
23.08.2013152.62 Кб16Pyle W.Making electricity with hydrogen.1993.pdf
#
23.08.2013174.96 Кб20Qiao S.Principles of floating point computation.pdf
#
23.08.2013175.05 Кб16Quan D.User interfaces for supporting multiple categorization.pdf
#
23.08.2013298.24 Кб17Rackham S.AsciiDoc user guide Rev7.1.2.2006.pdf
#
23.08.20134.45 Mб108Raisanen A.V.Radio engineering for wireless communication and sensor applications.2003.pdf
#
23.08.2013945.08 Кб14Raju V.'Hello world' with an AVR.pdf
#
23.08.2013357.78 Кб14Ramsey N.The C-- language specification.V2.0.2005.pdf