# Numerical ComputationMachine Arithmetic

Computers store all information, including numerical values, as sequences of **type** of a numeric value specifies how the underlying sequence of bits should be interpreted as a number.

You can access the bit representation of a numerical value in Julia using the function `bitstring`

. For example, we can inspect how boolean values are represented in Julia:

println(bitstring(true)) println(bitstring(false))

You might think that boolean values would be stored using a single bit. However, as you can see from the output above, in fact they use

In this section, we will introduce several of the most important numeric types.

**Exercise**

Humans typically interpret a string of digits as an integer using *place value*: the units digit is worth , the next digit to the left is worth , and so on. Then , for example. This is called the **decimal representation** of a number.

We can do the same thing with 2 in place of 10: the rightmost digit is worth , the next digit is worth , and so on. Instead of 10 digits we only have two *bits*: 0 and 1. This is called the **binary representation** of a number. The binary representation of 13, for example, is , since .

Find the binary representations of each of the following numbers: 2, 16, 20, and 100.

*Solution.* The binary representation of 2 is , of 16 is , of 20 is , and of 100 is .

## 64-bit Integers

The most common numeric types used in modern computers use

Therefore, with 64 bits we can represent numbers. For example, we could represent the integers from 0 to by interpreting each string of 0's and 1's as a binary number. This type exists (in Julia it's called `UInt64`

, for *unsigned 64-bit integer*), but most of the time it's important to be able to represent negative integers as well.

In order to make room for negatives, we will only use

More precisely, for , we represent using its binary representation, with leading zeros as necessary to get 64 total bits. For , we represent using the binary representation of . One good reason to use this convention rather than dedicating one bit to the number's sign and using the other 63 bits to represent the number's absolute value in binary is that

**Example**

The expression `bitstring(+34)`

evaluates to

`0000000000000000000000000000000000000000000000000000000000100010`

;This checks out: .

The expression `bitstring(-34)`

evaluates to

`1111111111111111111111111111111111111111111111111111111111011110`

.We could check that this is the binary representation of , but in the following exercise we will learn a trick for doing that without having to deal with all those 1's.

**Exercise**

Show that if , then you can find `bitstring(-n)`

from `bitstring(n)`

by (i) flipping every bit, and (ii) adding 1 to the resulting number (interpreted as an integer represented in binary).

*Solution.* Recall that for $-n$ is represented by the binary representation of Representing in binary is not straightforward but is easier because in binary is just ones. This means that for any we can get in binary by changing a to a in the binary representation of at every position where in binary has a . In other words, to get in binary, we just flip the binary representation of Since

we need to add after flipping the bits.

## 64-bit Floating Point Numbers

Integer types are appropriate for calculations which only involve integer operations (multiplication, addition, and negation), but the only integers with integer *reciprocals* are

Let's visualize our number system by placing a tick mark on the number line for each number we're representing. One simple idea is to choose some small value and represent integer multiples of :

These are called **fixed point** numbers, and they have some uses (for example, in financial applications where all monetary values can safely be assumed to be a multiple of one cent).

The problem with fixed point numbers for most scientific computing applications is that when we get very small numbers (for example, when taking the reciprocal of very large numbers), they have to be rounded off a lot relative to the size of the number. For example, if , then the reciprocal of 1999 would have to be rounded from approximately to . If that value is then multiplied by, say, 8000, then the round-off would result in a product of 8 instead of approximately 4.

We could address that problem by making smaller, but we encounter a tradeoff between representing small numbers accurately and being able to represent large numbers at all. Furthermore, if is very small, then large numbers are being represented with unnecessarily high precision (relative to the size of the number). The way out of this tradeoff is to relax the fixed-width increment and using a variable-sized gap between representable numbers. Such number systems are called **floating point** systems.

One way to represent numbers more densely near zero is to put equally spaced tick marks between 1 and 2, and then scale that interval up repeatedly into , then , then , and so on, and also scale it down to , , and so on. Here's an example of such a scheme: we place 8 tick marks between 1 and 2, and then we scale that interval's worth of tick marks four times by a factor of 2, and also 3 times by a factor of .

We can see on the left edge of the (top) picture that we didn't cover zero! There is also quite a large gap between 0 and the smallest representable number in this system. So we replace the leftmost interval between successive powers of 2 with ticks twice as far apart so that they reach down to 0. The locations of these new ticks are called *subnormal numbers*.

If we have 64 bits, we can do the same thing but on a grander scale. Rather than subdividing the interval into equal-length intervals, we use intervals. Rather than scaling than interval up and down just a few times in each direction, we scale up 1023 times—covering every binary interval up to —and down 1022 times—covering every binary interval down to . Finally, the subnormal numbers will be equally spaced between 0 and .

We can accomplish all of this if we dedicate 11 bits to indicating the index of the binary interval we're in—starting from for numbers in the subnormal range and going up to for the last interval —and 52 bits for indicating the index of the tick within that interval. For example, the number would correspond to

The nonnegative representable numbers are laid out as shown (figure not drawn to scale!):

That leaves us with one bit, which we call , to indicate the sign of the value we're representing. Also, note that we're leaving out one possible value (the last one, ); more on that later.

We can use this description to write down formulas for the real number values represented by each string of 64 bits. We define to be the first bit of the string, to be the next 11 bits interpreted as a binary integer, and to be the remaining 52 bits interpreted as a binary integer. If , then the string represents the number

Note that the exponent ranges from up to as ranges from 1 to . If , then the string represents

These are the

Another way to visualize the floating point number system is to graph the floating point value associated with each binary string against its value as an unsigned integer. Here's an example for an 8-bit floating point scheme, with one sign bit, three bits for the exponent, and four bits for the mantissa. Subnormal numbers are shown in red, and powers of 2 are shown in gold.

We appropriate the *last* value of for a special meaning: if , then the string represents one of the special values `Inf`

or `NaN`

, depending on the value of .

**Example**`bitstring(-0.75)`

returns

`1011111111101000000000000000000000000000000000000000000000000000`

.The leading `1`

indicates that the number is negative, the next eleven digits `01111111110`

give , and interpreting the remaining 52 digits as a base-2 fraction gives . So the value represented is

Thus can be represented exactly as a `Float64`

.

**Exercise**

Show that cannot be represented exactly as a `Float64`

.

*Solution.* By construction, every representable Float64 is a rational number whose denominator is a power of 2. Therefore, when a representable Float64 is written as a fraction and simplified, its denominator is a power of 2. Since does not fit this description, it is not representable as a `Float64`

.

**Exercise**

The Julia function `nextfloat`

returns the smallest representable value which is larger than its argument. What value will be returned by the code below?

log2(nextfloat(11.0) - 11.0)

*Solution.* The difference between 11 and the next Float64 is , since 11 is in the binary interval , which is 3 binary intervals to the right of . So the value returned will be .

## 32-bit Floating Point Numbers

Each 32-bit string represents the value

where is the first bit of the string, is the next 8 bits interpreted as a binary integer, and is the remaining 23 bits are interpreted as a binary integer. In other words, there are

**Example**

`bitstring(Float32(-0.75))`

returns `10111111010000000000000000000000`

**Exercise**

Find the positive difference between 1 and the first number greater than 1 which is representable as a `Float32`

.

*Solution.* We can represent 1 by choosing and . To represent the largest possible number less than 1, we let the 23-bit string representing be 22 zeros followed by a 1. So the answer is .

We can check this in Julia using `nextfloat(Float32(1.0)) - 1`

, and indeed it returns a value which is equal to : `Float32(2)^(-23) == nextfloat(Float32(1.0)) - 1`

returns `true`

.

## Arbitrary-Precision Numbers

Sometimes you might want store a number without being constrained to 64 bits, or even 128 or 256 bits. It is possible to define a type which uses an *extensible* number of bits (depending on the size of the integer or precision of the real number being stored). These types are called **bignums**, and in Julia they're called `BigInt`

and `BigFloat`

.

**Exercise**

(i) Arbitrary-precision arithmetic is helpful for inspecting the behavior of lower precision formats. Find the exact value of the difference between 1.1 and the `Float64`

value nearest 1.1. (Hint: `big(1.1)`

interprets 1.1 as a Float64 value—look at that value and mentally subtract 1.1).

(ii) Confirm that calculating the decimal representation of is no problem with big number arithmetic. Convert the resulting sequence of decimal digits to a string and find its length.

*Solution.* `big(1.1)`

returns a number which is more than 1.1.

`length(string(big(2)^100000))`

returns , so that's how many digits has.

To obtain numerical values of other types in Julia, use `parse`

or `big`

.

parse(BigInt,"1267650600228229401496703205376")^(big(1)/100)

`Float64`

and `Int64`

operations are performed *in hardware*, meaning that they use instructions programmed directly into the computer's microprocessor. They are much faster and more memory efficient than arbitrary precision arithmetic, which has to be done

**Exercise** Run the cell below to figure out about how many times slower BigInt addition is relative to Int64 addition.

a = big.(collect(1:100_000)) b = collect(1:100_000) s = @elapsed(sum(a)) t = @elapsed(sum(b)) s/t

*Solution.* Outcomes can vary quite a bit, but you should find that the bignum operation typically takes 10-20 times as long.

## General Comments

Choice of numerical representation depends on the application at hand, since each has its advantages and disadvantages. `Int64`

arithmetic is actually *modular* arithmetic with a modulus of . This means that Int64 arithmetic is exact unless our calculation takes us outside the window of representable values. Basic `Float64`

operations return the same number as computing the mathematical result of the operation and rounding to the nearest `Float64`

-representable value (or one of the two nearest ones if it's halfway between).

**Exercise**

Without using a computer, perform the operations in the expression using `Float64`

arithmetic. Repeat with the expression . Then check your findings by evaluating these expressions in Julia.

*Solution.* The first expression evaluates to 1.0, since adding only gets you 40% of the way to the next representable value. The second expression evaluates to . And indeed, `(1.0 + 0.4/2^52) + 0.4/2^52 == 1.0 + (0.4/2^52 + 0.4/2^52)`

returns `false`

.

A *numeric literal* is a sequence of characters to be parsed and interpreted as a numerical value.

- Numeric literals with a decimal point are
*real literals*. - Numeric literals without a decimal point are
*integer literals*.

**Example**

In the expression `2.718^50+1`

, `2.718`

is a real literal, and `50`

and `1`

are both integer literals.

Integer literals are interpreted as

**Example**

`2^100`

returns `0`

, since `2`

and `100`

are interpreted as `Int64`

values, and is equivalent to 0 modulo .

**Exercise**

Explain why it is never necessary to use a `BigInt`

for a loop counter (that is, a variable which starts at 0 or 1 and is incremented by 1 each time the body of the loop runs).

*Solution.* An `Int64`

can store values as large as . Assuming optimistically that the body of the loop can execute in a nanosecond, the loop could run for nearly 300 years before exhausting the positive `Int64`

values. Therefore, we do not need `BigInt`

values for a loop counter.