Next: Bibliography Up: III. SELECTED APPLICATIONS Previous: 8.4 Direct Simulation Monte

Appendix

Machine Errors

Typically, in a computer real numbers are stored as follows:
$\textstyle \parbox{30pt}{\mbox{} [18pt]}$ $\textstyle \parbox{324pt}{ \begin{tabular}[b]{\vert l\vert l\vert l\vert} \hlin... ... & e (exponent; $8$ bits) & m (mantissa; $23$ bits) \\ \hline \end{tabular}}$

or, in a more usual notation,

$\begin{displaymath} x=\pm m \cdot 2^{\textstyle e-e_{0}} \end{displaymath}$

The mantissa is normalized, i.e. shifted to the left as far as possible, such that there is a in the first position; each left-shift by one position makes the exponent smaller by . (Since the leftmost bit of is then known to be , it need not be stored at all, permitting one further left-shift and a corresponding gain in accuracy; then has an effective length of 24 bits.)
The bias $e_{0}$ is a fixed, machine-specific integer number to be added to the ``actual'' exponent $e-e_{0}$ , such that the stored exponent remains positive.

EXAMPLE: With a bias of $e_{0}=151$ (and keeping the high-end bit of the mantissa) the internal representation of the number

is, using $1/4=(1\cdot2^{22})\cdot2^{-24}$ and

,

$\textstyle \parbox{30pt}{\mbox{}}$ $\textstyle \parbox{30pt}{\begin{displaymath}\frac{1}{4}= \end{displaymath}}$ $\textstyle \parbox{294pt}{ \begin{tabular}[b]{\vert l\vert l\vert l\vert} \hline $+$ & 127 & 1 0 0 \dots 0 0 \\ \hline \end{tabular} [3pt]}$
Before any addition or subtraction the exponents of the two arguments must be equalized; to this end the smaller exponent is increased, and the respective mantissa is right-shifted (decreased). All bits of the mantissa that are thus being ``expelled'' at the right end are lost for the accuracy of the result. The resulting error is called roundoff error. By machine accuracy we denote the smallest number that, when added to

, produces a result $\neq 1.0$ . In the above example the number $2^{-22} \equiv 2.38 \cdot 10^{-7}$ , when added to

, would just produce a result $\neq 1.0$ , while the next smaller representable number $2^{-23} \equiv 1.19 \cdot 10^{-7}$ would leave not a rack behind:

$\textstyle \parbox{30pt}{\mbox{}}$ $\textstyle \parbox{30pt}{\begin{displaymath}\;1.0 \end{displaymath}}$ $\textstyle \parbox{294pt}{\mbox{}\vspace{-12pt} \begin{tabular}[b]{\vert l\vert... ...ert} \hline $+$ & 129 & 1 0 0 \dots 0 0 \\ \hline \end{tabular} [6pt]}$
$\textstyle \parbox{30pt}{\mbox{}}$ $\textstyle \parbox{30pt}{\begin{displaymath}+2^{-22} \end{displaymath}}$ $\textstyle \parbox{294pt}{\mbox{}\vspace{-12pt} \begin{tabular}[b]{\vert l\vert... ...ert} \hline $+$ & 107 & 1 0 0 \dots 0 0 \\ \hline \end{tabular} [6pt]}$
$\textstyle \parbox{30pt}{\mbox{}}$ $\textstyle \parbox{30pt}{\begin{displaymath}=\;\;\;\; \end{displaymath}}$ $\textstyle \parbox{294pt}{\mbox{}\vspace{-12pt} \begin{tabular}[b]{\vert l\vert... ...ert} \hline $+$ & 129 & 1 0 0 \dots 0 1 \\ \hline \end{tabular} [6pt]}$
but

$\textstyle \parbox{30pt}{\mbox{}}$ $\textstyle \parbox{30pt}{\begin{displaymath}\;1.0 \end{displaymath}}$ $\textstyle \parbox{294pt}{\mbox{}\vspace{-12pt} \begin{tabular}[b]{\vert l\vert... ...ert} \hline $+$ & 129 & 1 0 0 \dots 0 0 \\ \hline \end{tabular} [6pt]}$
$\textstyle \parbox{30pt}{\mbox{}}$ $\textstyle \parbox{30pt}{\begin{displaymath}+2^{-23} \end{displaymath}}$ $\textstyle \parbox{294pt}{\mbox{}\vspace{-12pt} \begin{tabular}[b]{\vert l\vert... ...ert} \hline $+$ & 106 & 1 0 0 \dots 0 0 \\ \hline \end{tabular} [6pt]}$
$\textstyle \parbox{30pt}{\mbox{}}$ $\textstyle \parbox{30pt}{\begin{displaymath}=\;\;\;\; \end{displaymath}}$ $\textstyle \parbox{294pt}{\mbox{}\vspace{-12pt} \begin{tabular}[b]{\vert l\vert... ...ert} \hline $+$ & 129 & 1 0 0 \dots 0 0 \\ \hline \end{tabular} [6pt]}$
A particularly dangerous situation arises when two almost equal numbers have to be subtracted. Such a case is depicted in Figure 8.4.

$\textstyle \parbox{24pt}{\mbox{}}$ $\textstyle \parbox{24pt}{\mbox{}}$ $\textstyle \parbox{294pt}{\mbox{}\vspace{-12pt} \begin{tabular}[b]{\vert l\vert... ...rt} \hline $+$ & 35 & 1 1 1 \dots 1 1 1\\ \hline \end{tabular} [6pt]}$
$\textstyle \parbox{24pt}{\mbox{}}$ $\textstyle \parbox{24pt}{\begin{displaymath}-\;\;\;\;\end{displaymath}}$ $\textstyle \parbox{294pt}{\mbox{}\vspace{-12pt} \begin{tabular}[b]{\vert l\vert... ...t} \hline $+$ & 35 & 1 1 1 \dots 1 1 0 \\ \hline \end{tabular} [6pt]}$
$\textstyle \parbox{24pt}{\mbox{}}$ $\textstyle \parbox{24pt}{\begin{displaymath}=\;\;\;\; \end{displaymath}}$ $\textstyle \parbox{294pt}{\mbox{}\vspace{-12pt} \begin{tabular}[b]{\vert l\vert... ...t} \hline $+$ & 35 & 0 0 0 \dots 0 0 1 \\ \hline \end{tabular} [6pt]}$
$\textstyle \parbox{24pt}{\mbox{}}$ $\textstyle \parbox{24pt}{\begin{displaymath}=\;\;\;\; \end{displaymath}}$ $\textstyle \parbox{294pt}{\mbox{}\vspace{-12pt} \begin{tabular}[b]{\vert l\vert... ...t} \hline $+$ & 14 & 1 0 0 \dots 0 0 0 \\ \hline \end{tabular} [6pt]}$

Subtraction of two almost equal numbers
In the last (normalization) step the mantissa is arbitrarily filled up by zeros; the uncertainty of the result is $50\%$ .

There is an everyday task in which such small differences may arise: solving the quadratic equation $ax^{2}+bx+c=0$ . The usual formula

$\begin{displaymath} x_{1,2}=\frac{-b\pm\sqrt{b^{2}-4ac}}{2a} \end{displaymath}$

(8.91)

will yield inaccurate results whenever $ac«b^{2}$ . Since in writing a program one must always provide for the worst possible case, it is recommended to use the equivalent but less error-prone formula

$\begin{displaymath} x_{1}=\frac{q}{a} ,\hspace{60pt}x_{2}=\frac{c}{q} \\ \end{displaymath}$

(8.92)

with

$\begin{displaymath} q\equiv - \frac{1}{2}\left[ b+sgn(b) \sqrt{b^{2}-4ac} \right] \end{displaymath}$

(8.93)

EXERCISE: Assess the machine accuracy of your computer by trying various negative powers of

, each time adding and subtracting the number

and checking whether the result is zero.

Discrete Fourier Transformation
FundamentalsWe are using the convention

$\begin{displaymath} \tilde{f}(\nu)=\int \limits_{-\infty}^{\infty}f(t) e^{2\pi\... ...ts_{-\infty}^{\infty}\tilde{f}(\nu) e^{-2\pi i \nu t} d\nu \end{displaymath}$

(8.94)

Assume that the function

is given only at discrete, equidistant values of its argument:

$\begin{displaymath} f_{k}\equiv f(t_{k})\equiv f(k \Delta t) \;\;\;k=\dots -2, -1,0,1,2, \dots \end{displaymath}$

(8.95)

The reciprocal value of the time increment $\Delta t$ is called sampling rate. The higher the sampling rate, the more details of the given function

will be captured by the table of discrete values $f_{k}$ . This intuitively evident fact is put in quantitative terms by Nyquist's theorem: if the Fourier spectrum of

$\begin{displaymath} \tilde{f}(\nu)\equiv \int \limits_{-\infty}^{\infty}f(t) e^{\textstyle 2\pi i\nu t} dt \end{displaymath}$

(8.96)

is negligible for frequencies beyond the critical (or Nyquist) frequency

$\begin{displaymath} \pm \nu_{0} \equiv \pm \frac{1}{2 \Delta t} \end{displaymath}$

(8.97)

then

is called a band-limited process. Such a process is completely determined by its sampled values $f_{k}$ . The formula that permits the reconstruction of

from the sampled data reads

$\begin{displaymath} f(t)=\sum_{k=-\infty}^{\infty}f_{k} \frac{\sin[2\pi \nu_{0}(t-k\Delta t)]}{2\pi \nu_{0}(t-k\Delta t)} \end{displaymath}$

(8.98)

(In contrast, if

is not band-limited, sampling with finite time resolution results in ``mirroring in'' the outlying parts of the spectrum from beyond $\pm \nu_{0}$ , superposing them on the correct spectrum. In signal processing this effect is known as ``aliasing''.)

Let us assume now that a finite set of sampled values is given:

$\begin{displaymath} f_{k} ,\;\;k=0,1, \dots N-1 \end{displaymath}$

(8.99)

and let

be an even number. Define discrete frequencies by

$\begin{displaymath} \nu_{n}\equiv\frac{n}{N\Delta t} ,\;\;\; n=-\frac{N}{2},\dots ,0, \dots ,\frac{N}{2} \end{displaymath}$

(8.100)

(The $\nu_{n}$ pertaining to

is again the Nyquist frequency.) Then the Fourier transform of

at some frequency $\nu_{n}$ is given by

$\begin{displaymath} \tilde{f}(\nu_{n}) \approx \Delta t \sum_{k=0}^{N-1}f_{k} e^... ...k}} = \Delta t \sum_{k=0}^{N-1}f_{k}e^{\textstyle 2\pi i kn/N} \end{displaymath}$

(8.101)

Thus it makes sense to define the discrete Fourier transform as

$\fbox{\begin{minipage}{393pt} \mbox{} [-24pt] \begin{equation} F_{n} \equiv \s... ...ation}\hspace{60pt}with $N$ even, and $n=0, \pm 1, \dots, N/2$\end{minipage}}$

According to 8.101 the Fourier transform proper is just $\tilde{f}(\nu_{n}) \approx \Delta t F_{n}$ .

From the definition of $F_{n}$ it follows that $F_{-n}=F_{N-n}$ . We make use of this periodicity to renumber the $F_{n}$ such that runs from to (instead of to ):


          

,		,				,				,		,		,				



						,				,		,		,

With this indexing convention the back transformation may be conveniently written

$\begin{displaymath} f_{k} =\frac{1}{N} \sum_{n=0}^{N-1}F_{n}e^{\textstyle -2\pi ikn/N} \end{displaymath}$

(8.102)

Fast Fourier Transform (FFT)If we were to use the definition 8.102 ``as is'' to calculate the discrete Fourier transform, we would have to perform some $N^{2}$ operations. Cooley and Tukey (and before them Danielson and Lanczos; see [PRESS 86]) have demonstrated how, by smart handling of data, the number of operations may be pushed down to $\approx N \log_{2}N$ . Note that for

this is an acceleration of

. Indeed, many algorithms of modern computational physics hinge on this possibility of rapidly transforming back and forth long tables of function values.

In the following it is always assumed that $N=2^{m}$ . If is not a power of , simply ``pad'' the table, putting $f_{k}=0$ up to the next useful table length. Defining

$\begin{displaymath} W_{N}\equiv e^{\textstyle 2\pi i/N} \end{displaymath}$

(8.103)

we realize that $W_{N}^{2}=W_{N/2}$ etc. The discrete Fourier transform is therefore

$\displaystyle F_{N}$	$\textstyle =$	$\displaystyle \sum_{k=0}^{N-1}W_{N}^{nk}f_{k}$	(8.104)
	$\textstyle =$	$\displaystyle \sum_{l=0}^{N/2-1}W_{N/2}^{nl}f_{2l} +W_{N}^{n}\sum_{l=0}^{N/2-1}W_{N/2}^{nl}f_{2l+1}$	(8.105)
	$\textstyle \equiv$	$\displaystyle \quad F_{n}^{e}\quad+\quad W_{N}^{n} F_{n}^{o}$	(8.106)

where the indices e and o stand for ``even'' and ``odd''. Next we treat each of the two terms to the right of 8.107 by the same pattern, finding

$\displaystyle F_{n}^{e}$	$\textstyle =$	$\displaystyle F_{n}^{ee}+W_{N/2}^{n}F_{n}^{eo}$	(8.107)
$\displaystyle F_{n}^{o}$	$\textstyle =$	$\displaystyle F_{n}^{oe}+W_{N/2}^{n}F_{n}^{oo}$	(8.108)

By iterating this procedure $m=\log_{2} N$ times we finally arrive at terms $F_{n}^{(...)}$ that are identical to the given table values $f_{k}$ .

EXAMPLE: Putting

we have $W_{4} \equiv exp[2\pi i/4]$ and

$\displaystyle F_{n}$	$\textstyle =$	$\displaystyle \sum_{k=0}^{3}W_{4}^{nk}f_{k} \;\;n=0,\dots 3$	(8.109)
	$\textstyle =$	$\displaystyle \sum_{l=0}^{1}W_{2}^{nl}f_{2l}+W_{4}^{n}\sum_{l=0}^{1}W_{2}^{nl}f_{2l+1}$	(8.110)
	$\textstyle \equiv$	$\displaystyle \hspace{2em}F_{n}^{e}\;\;\;+\;\;\;W_{4}^{n}F_{n}^{o}$	(8.111)
	$\textstyle =$	$\displaystyle F_{n}^{ee}+W_{2}^{n}F_{n}^{eo} +W_{4}^{n}\left[ F_{n}^{oe}+W_{2}^{n}F_{n}^{oo}\right]$	(8.112)
	$\textstyle =$	$\displaystyle f_{0}+W_{2}^{n}f_{2} +W_{4}^{n} \left[f_{1}+W_{2}^{n}f_{3} \right]$	(8.113)

Thus the correspondence between the table values $f_{k}$ and the terms $F_{n}^{ee}$ etc. is as follows:

EXERCISE: Demonstrate that a similar analysis as above leads for

to the correspondences

It is easy to see that this correspondence is reproduced by the following rule: 1) put $e\leftrightarrow0$ and $o\leftrightarrow 1$ , such that $eeo \leftrightarrow 001$ etc.; 2) reverse the bit pattern thus obtained and interpret the result as an integer number: $eeo \leftrightarrow 001 \leftrightarrow 100 = 4$ . In other words, arrange the table values $f_{k}$ in bit-reversed order. (For example,

is at position

since $4=100 \longrightarrow 001=1$ .)

The correctly arranged $f_{k}$ are now combined in pairs according to 8.114. The rule to follow in performing this ``decimation'' step is sketched in Fig. 8.4.

Decimation for

On each level (

) the terms

are combined according to

$\begin{displaymath} a+W_{2^{m}}^{n}b \end{displaymath}$

(8.114)

It is evident that the number of operations is of order $N \log_{2}N$ .

Further details of the method, plus sample programs in Pascal, Fortran, or C are given in [PRESS 86].

EXERCISE: Sketch the pattern of Figure 8.4 for and perform the ``decimation''. Compare your result to equ. 8.114.

Next: Bibliography Up: III. SELECTED APPLICATIONS Previous: 8.4 Direct Simulation Monte

Franz J. Vesely Oct 2005
See also: "Computational Physics - An Introduction," Kluwer-Plenum 2001