D. J. Bernstein
Fast arithmetic
djbfft

Complex convolution library interface

djbfft currently supports sizes 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, and 8192. This document focuses on 256 for simplicity.

Single-precision transforms

     #include <fftc4.h>

     complex4 a[256];

     fftc4_256(a);
     fftc4_scale256(a);
     fftc4_un256(a);
fftc4_256 computes a 256-point complex discrete Fourier transform. It evaluates the polynomial
     a[0] + a[1] x + a[2] x^2 + ... + a[255] x^255
at all the 256th roots of 1, and puts the values into a, overwriting the input. (Beware that the results are stored in an unusual order.) Each a[n] is a complex number with 4-byte real part a[n].re and 4-byte imaginary part a[n].im.

To compute the inverse transform, reconstructing a polynomial from its values, call fftc4_scale256 and then fftc4_un256. (fftc4_scale256 multiplies each a[n] by 1/256.)

Note that the position of a in memory can affect performance.

Single-precision convolutions

     #include <fftc4.h>

     complex4 a[256];
     complex4 b[256];

     fftc4_mul256(a,b);
fftc4_mul256 multiplies each a[n] by b[n] and puts the result into a[n].

The sequence of operations

     fftc4_256(a);
     fftc4_256(b);
     fftc4_mul256(a,b);
     fftc4_scale256(a);
     fftc4_un256(a);
convolves a with b: it multiplies the polynomial
     a[0] + a[1] x + a[2] x^2 + ... + a[255] x^255
by
     b[0] + b[1] x + b[2] x^2 + ... + b[255] x^255
modulo x^256-1 and puts the result back into a. The sequence of operations
     fftc4_256(b);
     fftc4_scale256(b);
     fftc4_256(a);
     fftc4_mul256(a,b);
     fftc4_un256(a);
has the same effect. If you have many polynomials to multiply by the same b, you can save time by reusing the transformed (and scaled) b.

Double-precision transforms and convolutions

     #include <fftc8.h>

     complex8 a[256];
     complex8 b[256];

     fftc8_256(a);
     fftc8_scale256(a);
     fftc8_un256(a);

     fftc8_mul256(a,b);
The fftc8 functions are just like the fftc4 functions except that they work with 8-byte floating-point numbers instead of 4-byte floating-point numbers.

WARNING: Some compilers, notably gcc without the -malign-double option, do not guarantee 8-byte alignment for 8-byte floating-point variables. The Pentium, Pentium II, et al. will slow down dramatically if your arrays are not aligned to 8-byte boundaries.

Frequencies

     #include <fftfreq.h>

     unsigned int n;
     unsigned int f;

     f = fftfreq_c(n,256);
What fftc4_256 and fftc8_256 put into a[n] is the value of the input polynomial at exp(2 pi if/256) where f = fftfreq_c(n,256).