# Numerical ComputationError

Error is the discrepancy between a quantity and the value used to represent it in the program. A result is accurate if its error is small. If is an approximation for , then

• the absolute error is , and
• the relative error is .

We are usually more interested in relative error, since the relevance of an error is usually in proportion to the quantity being represented. For example, misreporting the weight of an animal by one kilogram would be much more significant if the animal were a squirrel than if it were a blue whale.

Example
The expression sqrt(200.0), which returns the Float64-square-root of 200, yields

The actual decimal representation of is

The difference between these values, , is the absolute error, and is the relative error.

## Sources of Numerical Error

There are a few categories of numerical error.

Roundoff error comes from rounding numbers to fit them into a floating point representation.

Example
0.2 + 0.1is equal to in Float64 arithmetic. The discrepancy between 0.3 and this value is roundoff error.

Truncation error comes from using approximate mathematical formulas or algorithms.

Example
The Maclaurin series of is , so approximating as yields a truncation error equal to .

Example
Newton's method approximates a zero of a function by starting with a value near the desired zero and defining for all .

Under certain conditions, converges to a zero of as . The discrepancy between and is the truncation error associated with stopping Newton's method at the $n$th iteration.

Example
We may approximate using the sum . The error associated this approximation is a type of truncation error.

Statistical error arises from using randomness in an approximation.

Example
We can approximate the average height of a population of 100,000 people by selecting 100 people uniformly at random and averaging their measured heights. The error associated with this approximation is an example of statistical error.

Exercise
Discuss the error in each of the following scenarios using the terms roundoff error, truncation error, or statistical error.

• We use the trapezoid rule with 1000 trapezoids to approximate .
• We are trying to approximate for some function f that we can compute, and we attempt to do so by running (f(5 + 0.5^100) - f(5))/0.5^100. We fail to get a reasonable answer.
• To approximate the minimum of a function , we evaluate at 100 randomly selected points in and return the smallest value obtained.

Solution.

• The more trapezoids we use, the more accurate our answer will be. The difference between the exact answer and the value we get when we stop at 1000 trapezoids is truncation error.
• The real problem here is roundoff error. 5 + 0.5^100 gets rounded off to 5.0, so the numerator will always evaluate to 0. However, even if we used a BigFloat version of each of these values, there would still be truncation error in this approximation, since the expression we used was obtained by cutting off the limit in the definition of the derivative at a small but positive increment size.
• This is an example of statistical error, since the output of the algorithm depends on the randomness we use to select the points.

## Condition Number

The derivative of a single-variable function may be thought of as a measure of how the function stretches or compresses absolute error:

The condition number of a function measures how it stretches or compresses relative error. Just as the derivative helps us understand how small changes in input transform to small changes in output, the condition number tells us how a small relative error in the initial data of a problem affects the relative error of the solution. We will use the variable to denote a problem's initial data and to denote the solution of the problem with initial data .

The condition number of a function is defined to be the absolute value of the ratio of the relative change in output of the function to a very small relative change in the input. The condition number of a problem is the condition number of the function which maps the problem's initial data to its solution.

Definition
If is the map from the initial data of a problem to its solution , then the condition number of the problem is

Example
Show that the condition number of is constant, for any .

Solution. We have

for all .

Example
Show that the condition number of the function is very large for values of near 1.

Solution. We substitute into the formula for condition number and get

for values of near . This expression goes to infinity as , so the condition number is very large.

Subtracting 1 from two numbers near 1 preserves their difference, but the size of this difference is increased because the numbers themselves are much smaller.

Example
If , then the solution of the equation is . If we change the initial data to , then the solution changes to , which represents a relative change of

in the solution. The relative change in input is , so taking the absolute value of the ratio of to and sending , we see that condition number of this problem is .

Exercise
Consider a function . If the input changes from to for some small value , then the output changes to approximately . Calculate the ratio of the relative change in the output to the relative change in the input, and show that you get

Solution. The relative change in output is

and the relative change in input is . Dividing these two quantities gives

as desired.

More generally, if the initial data is in and the solution is in , then the condition number is defined to be

where denotes the operator norm. The operator norm of the derivative is the appropriate generalization of the norm of the derivative of since it measures the maximum stretching factor of near .

### Well conditioned and ill-conditioned problems

If the condition number of a problem is very large, then small errors in the problem data lead to large changes in the result. A problem with large condition number is said to be ill-conditioned. Unless the initial data can be specified with correspondingly high precision, it will not be possible to solve the problem meaningfully.

Example Consider the following matrix equation for and .

Find the values of for which solving this equation for is ill-conditioned.

Solution. If , then the solution of this equation is

Using the formula for above, we can work out (after several steps) that

If is very close to , then is very large, and the matrix is ill-conditioned:

[2.01 3; 6 9] \ [4; 5]
[2.02 3; 6 9] \ [4; 5]

### Machine epsilon

Machine epsilon, denoted , is the maximum relative error associated with rounding a real number to the nearest value representable as a given floating point type. For Float64, this value is .

A competing convention—more widely used outside academia—defines to be the difference between 1 and the next representable number, which for Float64is . This is the value returned by eps() in Julia. Since we typically introduce a relative error on the order of to encode the initial data of a problem, the relative error of the computed solution should be expected to be no smaller than , regardless of the algorithm used.

### Algorithm stability

An algorithm used to solve a problem is stable if it is approximately as accurate as the condition number of the problem allows. In other words, an algorithm is unstable if the answers it produces have relative error many times larger than .

Example
Consider the problem of evaluating near for values of near 0. Show that the problem is well-conditioned, but algorithm of evaluating the expression following the order of operations is unstable.

Comment on whether there are stable algorithms for evaluating near .

Solution. Substituting this function into the condition number formula, we find that

Therefore, , which means that this problem is well-conditioned at 0. However, the algorithm of substituting directly includes an ill-conditioned step: subtracting 1.

What's happening is that a roundoff error of approximately is introduced when is rounded to the nearest Float64. When 1 is subtracted, we still have an error of around . Since , we will have , and that means that the relative error in the value we find for will be approximately . If is small, this will be many times larger than .

There are stable algorithms for approximating near . For example, we could use the Taylor series

and approximate as a sum of the first several terms on the right-hand side. Since power functions are well-conditioned (and performing the subtractions is also well-conditioned as long as is small enough that each term is much smaller than the preceding one), this algorithm is stable. Alternatively, we can use the identity

which can be obtained by multiplying by and simplifying the numerator. Substituting into this expression is stable, because adding 1, square rooting, and reciprocating are well-conditioned.

### Matrix condition number

The condition number of an matrix is defined to be the maximum condition number of the function as ranges over . The condition number of can be computed using its singular value decomposition:

Exercise
Show that the condition number of a matrix is equal to the ratio of its largest and smallest singular values.

Interpret your results by explaining how to choose two vectors with small relative difference which are mapped to two vectors with large relative difference by , assuming that has a singular value which is many times larger than another. Use the figure below to help with the intuition.

Solution. The derivative of the transformation is the matrix itself, and the operator norm of is equal to its largest singular value. Therefore, to maximize , we minimize the ratio . This ratio is minimized when is the right singular vector with the least singular value. Therefore, the maximum possible value of is the ratio of the largest singular value of to the smallest singular value of .

Exercise
Find the condition number of the function , where A = [1 2; 3 4] and show that there is a vector and an error for which the relative error is indeed magnified by approximately the condition number of .

using LinearAlgebra
A = [1 2; 3 4]

Solution. We choose v and e to be the columns of in the singular value decomposition of A:

using LinearAlgebra
A = [1 2; 3 4]
U, S, V = svd(A)
σmax, σmin = S
κ = σmax/σmin
v = V[:,2]
e = V[:,1]
rel_error_output = norm(A*(v+e) - A*v)/norm(A*v)
rel_error_input = norm(v + e - v) / norm(v)
rel_error_output / rel_error_input, κ

## Hazards

Integer or floating point arithmetic can overflow, and may do so without warning.

2^63
10.0^309

Example
In September 2013, NASA lost touch with the Deep Impact space probe because systems on board tracked time as a 32-bit-signed-integer number of tenth-second increments from January 1, 2000. The number of such increments reached the maximum size of a 32-bit signed integer in August of 2013.

Errors resulting from performing ill-conditioned subtractions are called catastrophic cancellation.

Example
Approximating with the result of sqrt(10^6 + 1) - sqrt(10^6), we get a relative error of approximately , while using 1/(sqrt(10^6 + 1) + sqrt(10^6)) gives a relative error of (more than a thousand times smaller).

Exercise
Use your knowledge of floating point arithmetic to explain why computing directly is much less precise than computing .

Solution. The gaps between successive representable positive values get wider as we move on the right on the number line. Therefore, the error of the first calculation is the roundoff error associated with calculating sqrt(10^6+1), which is roughly .

The relative error in 1/(sqrt(10^6 + 1) + sqrt(10^6)), meanwhile, is approximately the same as the relative error in the calculation of sqrt(10^6 + 1) + sqrt(10^6) (since the condition number of the reciprocal function is approximately 1). This relative error is only about .

If you rely on exact comparisons for floating point numbers, be alert to the differences between Float64 arithmetic and real number arithmetic:

function increment(n)
a = 1.0
for i = 1:n
a = a + 0.01
end
a
end
increment(100) > 2
(increment(100) - 2) / eps(2.0)

Each time we add , we have to round off the result to represent it as a Float64. These roundoff errors accumulate and lead to a result which is two ticks to the right of 2.0.

It is often more appropriate to compare real numbers using ≈( \approx«tab»), which checks that two numbers and differ by at most .

Exercise
Guess what value the following code block returns. Run it and see what happens. Discuss why your initial guess was correct or incorrect, and suggest a value near 0.1 that you could use in place of 0.1 to get the expected behavior.

function increment_till(t, step=0.1)
x = 0.5
while x < t
x += step
end
x
end
increment_till(1.0)

Solution. It's reasonable to guess that the returned value will be 1.0. However, it's actually approximately 1.1. The reason is that adding the Float64 representation of 0.1 ten times starting from 0.0 results in a number slightly smaller than 1.0. It turns out that 0.6 (the real number) is 20% of the way from the Float64 tick before it to the Float64 tick after it:

(1//10 - floor(2^53 * 1//10) // 2^53) * 2^53

This means that the Float64 sum is less than the mathematical sum after adding 0.1 once, then less after adding 0.1 again, and so on. By the time we we get to , we've lost a full tick spacing so after 5 iterations, is equal to .

We could use 1/8 = 0.125 instead of 0.1 to get the expected behavior, since small inverse powers of 2 and their sums with small integers can be represented exactly as 64-bit floating points numbers.

Bruno