1.2 Round-off Errors and Computer Arithmetic 19 The exponential part of the number is, therefore, 2021-102=24. The final 52 bits specify that the mantissa is +1 As a consequence, this machine number precisely represents the decimal number 1)2-(1+n=(-10.201(1+(2+81632256+409 27.56640625 However. the next smallest machine number is 0 10000000011 1011100100001111111111111111111111111111111111111111 and the next largest machine number is 0100000000111011100100010000000000000000000000000000000000000001 This means that our original machine number represents not only 27.56640625, but also half of the real numbers that are between 27,56640625 and the next smallest machine number as well as half the numbers between 27.56640625 and the next largest machine number. To be precise, it represents any real number in the interval 27.5664062499999982236431605997495353221893310546875, 27.5664062500000017763568394002504646778106689453125) The smallest normalized positive number that can be represented has s=0,c and f=0 and is equivalent to 2-102.(1+0)≈0.22251×10 and the largest has s=0, c=2046, and f=1-2-2 and is equivalent to 2023.(2-2-32)≈0.17977×1030 Numbers occurring in calculations that have a magnitude less than 2-102.(1+0) result in underflow and are generally set to zero. Numbers greater than result in overflow and typically cause the computations to stop(unless the program has been designed to detect this occurrence). Note that there are two representations for the number zero; a positive 0 when s =0,c=0 and f =0, and a negative 0 when s= 1, c=O and f=0 Copyright 2010 Cengage Learning. All Rights May no be copied, scanned, or duplicated, in whole or in part Due to maternally aftec the overall leaning expenence. Cengage Learning
1.2 Round-off Errors and Computer Arithmetic 19 The exponential part of the number is, therefore, 21027−1023 = 24. The final 52 bits specify that the mantissa is f = 1 · 1 2 1 + 1 · 1 2 3 + 1 · 1 2 4 + 1 · 1 2 5 + 1 · 1 2 8 + 1 · 1 2 12 . As a consequence, this machine number precisely represents the decimal number (−1) s 2c−1023(1 + f ) = (−1) 0 · 21027−1023 1 + 1 2 + 1 8 + 1 16 + 1 32 + 1 256 + 1 4096 = 27.56640625. However, the next smallest machine number is 0 10000000011 1011100100001111111111111111111111111111111111111111, and the next largest machine number is 0 10000000011 1011100100010000000000000000000000000000000000000001. This means that our original machine number represents not only 27.56640625, but also half of the real numbers that are between 27.56640625 and the next smallest machine number, as well as half the numbers between 27.56640625 and the next largest machine number. To be precise, it represents any real number in the interval [27.5664062499999982236431605997495353221893310546875, 27.5664062500000017763568394002504646778106689453125). The smallest normalized positive number that can be represented has s = 0, c = 1, and f = 0 and is equivalent to 2−1022 · (1 + 0) ≈ 0.22251 × 10−307, and the largest has s = 0, c = 2046, and f = 1 − 2−52 and is equivalent to 21023 · (2 − 2−52) ≈ 0.17977 × 10309. Numbers occurring in calculations that have a magnitude less than 2−1022 · (1 + 0) result in underflow and are generally set to zero. Numbers greater than 21023 · (2 − 2−52) result in overflow and typically cause the computations to stop (unless the program has been designed to detect this occurrence). Note that there are two representations for the number zero; a positive 0 when s = 0, c = 0 and f = 0, and a negative 0 when s = 1, c = 0 and f = 0. Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
20 CHAPTER 1- Mathematical Preliminaries and Error Analysis Decimal Machine Numbers The use of binary digits tends to conceal the computational difficulties that occur when a finite collection of machine numbers is used to represent all the real numbers. To examin these problems, we will use more familiar decimal numbers instead of binary representation Specifically, we assume that machine numbers are represented in the normalized decimal floating-point form ±0.d1d2…dkx10,1≤d1≤9,and0≤d≤9, for each i= 2,..., k Numbers of this form are called k-digit decimal machine numbers Any positive real number within the numerical range of the machine can be normalized to the form y=0.d1d2…dkdk+1dk 10n The error that results from The floating-point form of y, denoted fl(y), is obtained by terminating the mantissa of replacing a number with its y at k decimal digits. There are two common ways of performing this termination. One floating-point form is called method, called chopping, is to simply chop off the digits dk+ldk+2.. This produces the round-off error regardless of loating-point form whether the rounding o chopping method is used. f1(y)=0.d1d2..dk×10 The other method, called rounding, adds 5 x 10/-+d to y and then chops the result to obtain a number of the form fl(y)=0.8162.k×10 For rounding, when dk+1>5, we add 1 to dk to obtain f10); that is, we round up. When dk+1 <5, we simply chop off all but the first k digits; so we round down. If we round down, then 8i= di, for each i= 1, 2,..., k. However, if we round up, the digits(and even the exponent)might change Example 1 Determine the five-digit(a)chopping and(b) rounding values of the irrational number Solution The number t has an infinite decimal expansion of the form T=3. 14159265 Written in normalized decimal form we have 丌=0.314159265 The relative error is generally a a)The floating-point form of T using five-digit chopping is better measure of accuracy than the absolute error because it takes f(r)=0.31415×101=3.1415 into consideration the size of the number being approximated. b) The sixth digit of the decimal expansion of T is a 9, so the floating-point form of T using five-digit rounding fl(r)=(0.31415+0.0000×1032=3.1416 The following definition describes two methods for measuring approximation errors. Definition 1.15 Suppose that p*is an approximation to p. The absolute error is lp-p l, and the relative error Is vice P≠0. Consider the absolute and relative errors in representing p by p in the following Copyright 2010 Cengage Learning. All Rights t materially affect the overall leaming eaperience Cengage Learning reserves the right to remo rty commen may be suppressed from the eBook andor eChaptert'sh. May no be copied, scanned, or duplicated, in whole or in part Due to
20 CHAPTER 1 Mathematical Preliminaries and Error Analysis Decimal Machine Numbers The use of binary digits tends to conceal the computational difficulties that occur when a finite collection of machine numbers is used to represent all the real numbers. To examine these problems, we will use more familiar decimal numbers instead of binary representation. Specifically, we assume that machine numbers are represented in the normalized decimal floating-point form ±0.d1d2 ... dk × 10n , 1 ≤ d1 ≤ 9, and 0 ≤ di ≤ 9, for each i = 2, ... , k. Numbers of this form are called k-digit decimal machine numbers. Any positive real number within the numerical range of the machine can be normalized to the form y = 0.d1d2 ... dkdk+1dk+2 ... × 10n . The error that results from The floating-point form of y, denoted f l(y), is obtained by terminating the mantissa of replacing a number with its floating-point form is called round-off error regardless of whether the rounding or chopping method is used. y at k decimal digits. There are two common ways of performing this termination. One method, called chopping, is to simply chop off the digits dk+1dk+2 .... This produces the floating-point form f l(y) = 0.d1d2 ... dk × 10n . The other method, called rounding, adds 5 × 10n−(k+1) to y and then chops the result to obtain a number of the form f l(y) = 0.δ1δ2 ...δk × 10n . For rounding, when dk+1 ≥ 5, we add 1 to dk to obtain f l(y); that is, we round up. When dk+1 < 5, we simply chop off all but the first k digits; so we round down. If we round down, then δi = di, for each i = 1, 2, ... , k. However, if we round up, the digits (and even the exponent) might change. Example 1 Determine the five-digit (a) chopping and (b) rounding values of the irrational number π. Solution The number π has an infinite decimal expansion of the form π = 3.14159265.... Written in normalized decimal form, we have π = 0.314159265 ... × 101 . (a) The floating-point form of π using five-digit chopping is f l(π ) = 0.31415 × 101 = 3.1415. (b) The sixth digit of the decimal expansion of π is a 9, so the floating-point form of π using five-digit rounding is f l(π ) = (0.31415 + 0.00001) × 101 = 3.1416. The following definition describes two methods for measuring approximation errors. The relative error is generally a better measure of accuracy than the absolute error because it takes into consideration the size of the number being approximated. Definition 1.15 Suppose that p∗ is an approximation to p. The absolute error is |p − p∗|, and the relative error is |p − p∗| |p| , provided that p = 0. Consider the absolute and relative errors in representing p by p∗ in the following example. Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it
by P 0.3000×10-3andp’=0.31 0×104andp*=0.3 (a) For p=0.3000 x 10 and p=0.3100 x 10 the absolute error is 0.1, and the 0.3333×10 We often cannot find an accurate (b)Forp=0.3000×10-3andp*=03100×10-3 the absolute error is.1×10-4 value for the true error in ar Ind the relative error is 0.3333 x 10-1 approximation. Instead we find a bound for the error, which gives (e)For p=0.3000 x 10 and P*=0.3100 x 104, the absolute error is 0. 1 x 10,and the relative error is again 0.3333 x 10-1. This example shows that the same relative error, 0.3333 x 10-, occurs for widely varying absolute errors. As a measure of accuracy, the absolute error can be misleading and the relative error more meaningful. because the relative error takes into consideration the size of the value The following definition uses relative error to give a measure of significant digits of accuracy for an approximation Definition 1.16 The number p* is said to approximate p to t significant digits(or figures)if t is the largest nonnegative integ often used to loosely describe the IP-P*'l number of decimal digits that ≤5 appear to be accurate. The definition is more precise, and provides a continuous concept. Table 1. I illustrates the continuous nature of significant digits by listing, for the various values of p, the least upper bound of lp-p'l, denoted max lp-p"l, when p agrees with p Table 1.1 0.5 100 100050009990 10000 max lp-p" 0.00005 0.00025 0.05 2.5 4.995 Returning to the machine representation of numbers, we see that the floating-point representation fl(y) for the number y has the relative error If k decimal digits and chopping are used for the machine representation of Copyright 2010 Cengage Learning. All Rights May no be copied, scanned, or duplicated, in whole or in part Due to maternally aftec the overall leaning expenence. Cengage Learning rese
1.2 Round-off Errors and Computer Arithmetic 21 Example 2 Determine the absolute and relative errors when approximating p by p∗ when (a) p = 0.3000 × 101 and p∗ = 0.3100 × 101; (b) p = 0.3000 × 10−3 and p∗ = 0.3100 × 10−3; (c) p = 0.3000 × 104 and p∗ = 0.3100 × 104. Solution (a) For p = 0.3000 × 101 and p∗ = 0.3100 × 101 the absolute error is 0.1, and the relative error is 0.3333 × 10−1. (b) For p = 0.3000 × 10−3 and p∗ = 0.3100 × 10−3 the absolute error is 0.1 × 10−4, and the relative error is 0.3333 × 10−1. (c) For p = 0.3000 × 104 and p∗ = 0.3100 × 104, the absolute error is 0.1 × 103, and the relative error is again 0.3333 × 10−1. This example shows that the same relative error, 0.3333 × 10−1, occurs for widely varying absolute errors. As a measure of accuracy, the absolute error can be misleading and the relative error more meaningful, because the relative error takes into consideration the size of the value. We often cannot find an accurate value for the true error in an approximation. Instead we find a bound for the error, which gives us a “worst-case” error. The following definition uses relative error to give a measure of significant digits of accuracy for an approximation. Definition 1.16 The number p∗ is said to approximate p to t significant digits (or figures) if t is the largest nonnegative integer for which |p − p∗| |p| ≤ 5 × 10−t . Table 1.1 illustrates the continuous nature of significant digits by listing, for the various values of p, the least upper bound of |p − p∗|, denoted max |p − p∗|, when p∗ agrees with p to four significant digits. The term significant digits is often used to loosely describe the number of decimal digits that appear to be accurate. The definition is more precise, and provides a continuous concept. Table 1.1 p 0.1 0.5 100 1000 5000 9990 10000 max |p − p∗| 0.00005 0.00025 0.05 0.5 2.5 4.995 5. Returning to the machine representation of numbers, we see that the floating-point representation f l(y) for the number y has the relative error y − f l(y) y . If k decimal digits and chopping are used for the machine representation of y = 0.d1d2 ... dkdk+1 ... × 10n , Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it
CHAPTER 1- Mathematical Preliminaries and Error Analysis f1(y)||0ad1d2.dkdk+1…×102-0d1d…dk×102 y 0.dk+ldk 0d1d2 0d1d2 Since d1 =0, the minimal value of the denominator is 0. 1. The numerator is bounded above by 1. As a consequence, ≤ 10 In a similar manner, a bound for the relative error when using k-digit rounding arithmeti is 0.5x 10-*+I (See Exercise 24.) Note that the bounds for the relative error using k-digit arithmetic are independent of the number being represented. This result is due to the manner in which the machine numbers are distributed along the real line. Because of the exponential form of the characteristic, the same number of decimal machine numbers is used to represent each of the intervals [O1,1[1, 10], and [10, 100]. In fact, within the limits of the machine, the number of decimal machine numbers in [10", 10+] is constant for all integers n Finite-Digit Arithmetic In addition to inaccurate representation of numbers, the arithmetic performed in a computer is not exact. The arithmetic involves manipulating binary digits by various shifting, or logical, operations. Since the actual mechanics of these operations are not pertinent to this presentation, we shall devise our own approximation to computer arithmetic. Although our arithmetic will not give the exact picture, it suffices to explain the problems that occur(For an explanation of the manipulations actually involved, the reader is urged to consult more technically oriented computer science texts, such as[Ma], Computer System Architecture.) Assume that the floating-point representations fi()and flo)are given for the real numbers x and y and that the symbols e, e, e represent machine addition, subtraction multiplication, and division operations, respectively. We will assume a finite-digit arithmetic given by x⊕y=fl(fl(x)+fl(y),x⑧y=fl(fl(x)×fl(y) xey= fl(fl(r)-fl)), xey= fi(fl(x):flo)) This arithmetic corresponds to performing exact arithmetic on the floating-point repre sentations of x and y and then converting the exact result to its finite-digit floating-point epresentation Rounding arithmetic is easily implemented in Maple. For example, the command causes all arithmetic to be rounded to 5 digits. To ensure that Maple uses approximate rather than exact arithmetic we use the evalf. For example, if x= T and y= v2 then evalf (x); evalf(y) produces 3. 1416 and 1.4142, respectively. Then fl(fl(x)+ flo))is performed using 5-digit rounding arithmetic with f(evalf(x)+evalf (y)) Copyright 2010 Cengage Learning. All Rights t materially affect the overall leaming eaperience Cengage Learning reserves the right to remo rty commen may be suppressed from the eBook andor eChaptert'sh. May no be copied, scanned, or duplicated, in whole or in part Due to
22 CHAPTER 1 Mathematical Preliminaries and Error Analysis then y − f l(y) y = 0.d1d2 ... dkdk+1 ... × 10n − 0.d1d2 ... dk × 10n 0.d1d2 ... × 10n = 0.dk+1dk+2 ... × 10n−k 0.d1d2 ... × 10n = 0.dk+1dk+2 ... 0.d1d2 ... × 10−k . Since d1 = 0, the minimal value of the denominator is 0.1. The numerator is bounded above by 1. As a consequence, y − f l(y) y ≤ 1 0.1 × 10−k = 10−k+1 . In a similar manner, a bound for the relative error when using k-digit rounding arithmetic is 0.5 × 10−k+1. (See Exercise 24.) Note that the bounds for the relative error using k-digit arithmetic are independent of the number being represented. This result is due to the manner in which the machine numbers are distributed along the real line. Because of the exponential form of the characteristic, the same number of decimal machine numbers is used to represent each of the intervals [0.1, 1], [1, 10], and [10, 100]. In fact, within the limits of the machine, the number of decimal machine numbers in [10n, 10n+1] is constant for all integers n. Finite-Digit Arithmetic In addition to inaccurate representation of numbers, the arithmetic performed in a computer is not exact. The arithmetic involves manipulating binary digits by various shifting, or logical, operations. Since the actual mechanics of these operations are not pertinent to this presentation, we shall devise our own approximation to computer arithmetic. Although our arithmetic will not give the exact picture, it suffices to explain the problems that occur. (For an explanation of the manipulations actually involved, the reader is urged to consult more technically oriented computer science texts, such as [Ma], Computer System Architecture.) Assume that the floating-point representations f l(x) and f l(y) are given for the real numbers x and y and that the symbols ⊕, , ⊗, .. represent machine addition, subtraction, multiplication, and division operations, respectively. We will assume a finite-digit arithmetic given by x ⊕ y = f l(f l(x) + f l(y)), x ⊗ y = f l(f l(x) × f l(y)), x y = f l(f l(x) − f l(y)), x .. y = f l(f l(x) ÷ f l(y)). This arithmetic corresponds to performing exact arithmetic on the floating-point representations of x and y and then converting the exact result to its finite-digit floating-point representation. Rounding arithmetic is easily implemented in Maple. For example, the command Digits := 5 causes all arithmetic to be rounded to 5 digits. To ensure that Maple uses approximate rather than exact arithmetic we use the evalf. For example, if x = π and y = √2 then evalf(x); evalf(y) produces 3.1416 and 1.4142, respectively. Then f l(f l(x) + f l(y)) is performed using 5-digit rounding arithmetic with evalf(evalf(x) + evalf(y)) Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it
1.2 Round-off Errors and Computer Arithmetic 23 which gives 4.5558. Implementing finite-digit chopping arithmetic is more difficult and of ste dure. Exercise 27 explores this problem. Example 3 Suppose that x=, and y=3. Use five-digit chopping for calculating x +y, x-y, x xy, andx÷y Solution Note that x=7=0714285andy=3=03 implies that the five-digit chopping values of x and y are fl(x)=0.71428×10° and fl0)=0.3333×10 Thus xy=ff(x)+fo)=f1(0.71428×10+0.333310) =f(.04761×10)=0.0476×10 The true value is x+y=,+3=2. so we have 22 Absolute Error =21 0.10476×101=0.190×10-4 Relative Erro =0.182×10 22/21 Table 1.2 lists the values of this and the other calculations Table 1.2 Operation Result actual value Absolute error Relative error xe y 0.10476×101 0.190×10-4 0.182×10-4 x台 0.38095×10° 8/21 0.238×10-50.625×10 ⑧ 0.23809×100 5/21 0.524×10-50.220×10-4 0.21428×101 15/7 0.571×10-4 0.267×10-4 The maximum relative error for the operations in Example 3 is 0.267 x 10-, so the arithmetic produces satisfactory five-digit results. This is not the case in the followin Example 4 Suppose that in addition to x=, and y=1 we have u=0.714251,U=987659,andu=0.1111110-4, fl(a)=0.71425×10,fl()=098765×105, and fl(u)=0.1×10-4 Determine the five-digit chopping values of xeu, (reu)ew,(xe u)@v, and u E U Copyright 2010 Cengage Learning. All Rights May no be copied, scanned, or duplicated, in whole or in part Due to maternally aftec the overall leaning expenence. Cengage Learning
1.2 Round-off Errors and Computer Arithmetic 23 which gives 4.5558. Implementing finite-digit chopping arithmetic is more difficult and requires a sequence of steps or a procedure. Exercise 27 explores this problem. Example 3 Suppose that x = 5 7 and y = 1 3 . Use five-digit chopping for calculating x + y, x − y, x × y, and x ÷ y. Solution Note that x = 5 7 = 0.714285 and y = 1 3 = 0.3 implies that the five-digit chopping values of x and y are f l(x) = 0.71428 × 100 and f l(y) = 0.33333 × 100 . Thus x ⊕ y = f l(f l(x) + f l(y)) = f l 0.71428 × 100 + 0.33333 × 100 = f l 1.04761 × 100 = 0.10476 × 101 . The true value is x + y = 5 7 + 1 3 = 22 21 , so we have Absolute Error = 22 21 − 0.10476 × 101 = 0.190 × 10−4 and Relative Error = 0.190 × 10−4 22/21 = 0.182 × 10−4 . Table 1.2 lists the values of this and the other calculations. Table 1.2 Operation Result Actual value Absolute error Relative error x ⊕ y 0.10476 × 101 22/21 0.190 × 10−4 0.182 × 10−4 x y 0.38095 × 100 8/21 0.238 × 10−5 0.625 × 10−5 x ⊗ y 0.23809 × 100 5/21 0.524 × 10−5 0.220 × 10−4 x .. y 0.21428 × 101 15/7 0.571 × 10−4 0.267 × 10−4 The maximum relative error for the operations in Example 3 is 0.267 × 10−4, so the arithmetic produces satisfactory five-digit results. This is not the case in the following example. Example 4 Suppose that in addition to x = 5 7 and y = 1 3 we have u = 0.714251, v = 98765.9, and w = 0.111111 × 10−4 , so that f l(u) = 0.71425 × 100 , f l(v) = 0.98765 × 105 , and f l(w) = 0.11111 × 10−4 . Determine the five-digit chopping values of x u, (x u) .. w, (x u) ⊗ v, and u ⊕ v. Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it