Performance of CCFFT on the T3D

Chris Yerkes ( yerkes@arsc.edu ) of the UAF Electrical Engineering Department is using the ARSC T3D to implement an application that needs a two dimensional FFT. As a first step towards implementing his application he timed the T3D library routine CCFFT that performs a complex to complex FFT. His timing program is a good example of a C program calling a library routine that is usually called from a Fortran program. Here's his code:

```
/* Test FFT program */

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <fortran.h>
#define MAXLEN 32768  /* maximum length of FFT */
#define MAXFFTS   32  /* number of FFTs to do  */
#define MAXP2     15  /* log2( MAXLEN )        */

float fortran rtc();
void fortran ccfft();

static double Rdata[2*MAXLEN*MAXFFTS];
static double Work[2*MAXLEN];
static double re_table[2*MAXLEN],re_work[4*MAXLEN];

void main()
{
int i,j,k,l,N,zero,one;
float t1,t2;
double et,t11[MAXP2],t21[MAXP2];
double rone;
double *Rd,*Wk;

Rd = &Rdata[0];
Wk = &Work[0];

N=2;
et = 0;
zero=0;
one=1;
rone=1.0;

for (j = 1; j < MAXP2 ; j++){
N *= 2;
ccfft(&zero,&N,&rone,&Work,&Work,&re_table,&re_work,&zero);
l = 0;
/* Initialize random vector */
for (k = 0; k < 2*MAXFFTS*N;k++) (*(Rd+k)) = rand();
/* Treat as 2d array of vectors of stride MAXFFTS*/
for (k = 0; k < MAXFFTS ;k++){
l = 0;
for (i=0;i < N;i++){
(*(Wk+(2*i)  )) = (*(Rd+(2*k)  +l));
(*(Wk+(2*i+1))) = (*(Rd+(2*k+1)+l));
l += MAXFFTS;
}
t1 = rtc();
ccfft(&one,&N,&rone,&Work,&Work,&re_table,&re_work,&zero);
t2 = rtc();
et += (t2-t1)/150000000.0; /*Elapsed time using 150MHz*/
l = 0;
for (i=0;i < N;i++){
(*(Rd+(2*k)  +l)) = (*(Wk+(2*i)  ));
(*(Rd+(2*k+1)+l)) = (*(Wk+(2*i+1)));
l += MAXFFTS;
}
}
t11[j-1]=(1.0/(double)MAXFFTS)*et;  /* Averaged elapsed time */
t21[j-1]=1.0/(1000000.*t11[j-1])*5*N*(j+1); /*MFLOPS from ccfft man page*/
printf(" %4d %10d %e %6.1f\n", j, N, t11[j-1],t21[j-1]);
}

}
```
On the 150 MFLOPs peak Alpha processor of the T3D, the performance was about 10% of peak, which was a disappointment. The T3D processor has a direct mapped 8KB cache and shares 2 pages of memory, so there can be significant degradation in performance when the program is missing cache lines or is swapping between pages. This is especially true with the power-of-two FFT, where it is very likely that consecutive loads map to the same cache line.

To check out the possibility that cache misses or page swapping were partly responsible, I added one line to the above program

```
```
and allowed the value of 'change' to range from 1 to 12. Here is a table of MFLOPs for different values of change:
```
Performance (MFLOPs) of the T3D routine CCFFT
for Chris Yerkes' timing program

value of the pad (i.e., "change")
trans-
---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
form
0    1    2    3    4    5    6    7    8    9   10   11   12
length
---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
4
7.3  7.3  7.3  7.5  7.5  7.6  7.6  7.3  7.2  7.2  7.2  7.5  7.5
8
8.7  8.8  8.6  8.9  8.8  9.1  9.1  8.7  8.6  8.7  8.7  8.8  8.8
16
10.9 11.8 11.7 11.3 11.3 11.6 11.6 10.9 10.8 11.7 11.8 11.2 11.2
32
13.2 14.3 14.2 13.3 13.3 13.3 14.3 12.6 12.6 14.2 14.2 13.3 13.3
64
15.7 17.1 16.7 15.7 15.7 16.6 17.2 15.3 15.4 17.1 17.1 15.7 15.7
128
17.6 19.1 18.8 17.4 17.2 19.2 19.5 17.6 17.6 19.0 19.1 17.2 17.4
256
19.6 21.3 21.3 19.0 18.9 21.5 21.7 19.6 19.5 21.4 21.3 19.0 19.0
512
18.9 19.6 19.6 18.5 18.5 19.8 19.8 18.9 18.9 19.6 19.6 18.5 18.5
1024
14.3 16.4 16.4 14.0 14.0 16.6 16.5 14.3 14.3 16.4 16.4 14.0 14.0
2048
11.4 14.3 14.3 11.2 11.2 14.4 14.3 11.4 11.4 14.3 14.3 11.2 11.2
4096
10.1 12.9 12.9 10.0 10.0 13.0 13.0 10.1 10.1 12.9 12.9 10.0 10.0
8192
8.8 11.0 11.0  8.7  8.7 11.1 11.1  8.8  8.8 11.0 11.0  8.7  8.6
16384
8.4 10.6 10.6  8.3  8.3 10.7 10.7  8.4  8.4 10.6 10.6  8.3  8.3
32768
8.1 10.1 10.1  8.0  8.0 10.2 10.2  8.1  8.1 10.1 10.1  8.0  8.0
```
From the table we have several observations:
1. Eventually the transform is so large that performance decreases, this must be a lost of cache locality. (It rarely happens in the Y-MP vector world that a larger problem is less efficient that a smaller one.)
2. The pad is a multiple of 32 bits but from the performance it looks like the allocation is 64 bits at a time.
3. A simple 1 word (64 bits) pad gets about a 20% performance boost for the largest transform (32768) for this timing program.
That such a small code change can cause such a large performance difference is an example of why optimization is an "unstable" process. Fiddling with pads between the other arrays had little or no effect on the speed of CCFFT. If anyone knows of other ways to speed up these FFTs, I'd appreciate hearing from you.

FFT Operation Counts

In the common man page for FFTs on Denali, the operations count of a power-of-two FFT is approximated as:
```
5 * n * log2( n )
```
where:
1. each addition and multiplication is one operation
2. n is the length of the complex to complex FFT
3. log2 is the base 2 logarithm of n
This was the operation count used in the article above. I used the Hardware Performance Monitor (hpm) on the Y-MP to find out just how close this estimate is to the actual number of operations. (Getting this number of operations is impossible on the T3D with Apprentice because Apprentice instruments user code and not library routines.) Running individual transforms and subtracting out the setup and initialization of tables and arrays, I've come up with the following table for the Y-MP routine ccfft:
```
Operation counts for the CCFFT routine on the Y-MP

Length of  Log2 of     Estimate of   Actual     Actual    Total Actual
Transform  the Length  operations   Additions  Multiplies   Operations
n        log2(n)   5*n*log2(n)
---------  ----------  -----------  ---------  ----------  -----------
4          2             40          31           7           38
8          3            120          81          25          106
16          4            320         173          49          222
32          5            800         471         232          703
64          6           1920        1047         488         1535
128          7           4480        2379        1086         3465
256          8          10240        5275        2270         7545
512          9          23040       12012        5480        17492
1024         10          51200       26476       11880        38356
2048         11         112640       58781       27314        86095
4096         12         245760      127453       58162       185615
8192         13         532480      279150      132156       411306
16384         14        1146880      598894      280124       879018
32768         15        2457600     1295743      625222      1920965
65536         16        5242880     2754124     1313795      4067919
```
There are several reasons why the approximation, 5*n*log2(n), is too generous:
1. The algorithm implemented is actually a 2, 4, and 8 radix algorithm not just a radix 2 algorithm.
2. The innermost loops of the implemented algorithm are probably interchanged to insure a 'good' vector length.
3. Trivial operations like multiplies by 1.0 and additions with 0.0 have been optimized out.
Most FFTs are optimized for short execution time and the statistics in MFLOPs (like in the above article) are only attempts to relate the FFT's performance to other algorithms or to other machines. For the comparisons above, as long as a consistent operation count is used the comparisons are valid. But for comparing the T3D FFT performance to other algorithms or to other machines, then a more accurate operation count is needed. Maybe it would be better for these FFTs to just stick with execution time and not attempt to present MFLOPs statistics.

Next week we'll have more results on the way to a two-dimensional FFT. There are two- and three-dimensional FFTs scheduled for the 1.2.1 release of the Programming Environment. That release isn't available yet but we may be able to move to it in July.

New T3D Batch PE Limits

In the past week all active users of the ARSC T3D had their batch PE limit increased to 128. This allows these users access to the 128-PE 8-hour queues that run on the weekends. If you need your T3D UDB limits changed please contact Mike Ess.

New Fortran Compiler

An upgrade version of the cf77 compiler is available on Denali with the path:
```
/mpp/bin/cft77new and /mpp/bin/cf77new
```
For the default versions we have:
```
/mpp/bin/cf77 -V
Cray CF77_M   Version 6.0.4.1 (6.59)   05/25/95 13:36:39
Cray GPP_M    Version 6.0.4.1 (6.16)   05/25/95 13:36:39
Cray CFT77_M  Version 6.2.0.4 (227918) 05/25/95 13:36:39
```
and for this new version:
```
/mpp/bin/cf77new -V
Cray CF77_M   Version 6.0.4.1 (6.59)   05/25/95 13:37:26
Cray GPP_M    Version 6.0.4.1 (6.16)   05/25/95 13:37:26
Cray CFT77_M  Version 6.2.0.9 (259228) 05/25/95 13:37:27
```
This new compiler fixes a potential race condition in shared memory accesses and also fixes an inlining problem with the F90 intrinsics, MINLOC and MAXLOC.

This compiler will become the default after we finish testing it and users will be notified before that happens. I encourage users to try this compiler before it becomes the default.

List of Differences Between T3D and Y-MP

The current list of differences between the T3D and the Y-MP is:
1. Data type sizes are not the same (Newsletter #5)
2. Uninitialized variables are different (Newsletter #6)
3. The effect of the -a static compiler switch (Newsletter #7)
4. There is no GETENV on the T3D (Newsletter #8)
5. Missing routine SMACH on T3D (Newsletter #9)