ARSC T3D Users' Newsletter 38, June 2, 1995
Performance of CCFFT on the T3D
Chris Yerkes ( yerkes@arsc.edu ) of the UAF Electrical Engineering Department is using the ARSC T3D to implement an application that needs a two dimensional FFT. As a first step towards implementing his application he timed the T3D library routine CCFFT that performs a complex to complex FFT. His timing program is a good example of a C program calling a library routine that is usually called from a Fortran program. Here's his code:
/* Test FFT program */ #include <stdio.h> #include <stdlib.h> #include <math.h> #include <fortran.h> #define MAXLEN 32768 /* maximum length of FFT */ #define MAXFFTS 32 /* number of FFTs to do */ #define MAXP2 15 /* log2( MAXLEN ) */ float fortran rtc(); void fortran ccfft(); static double Rdata[2*MAXLEN*MAXFFTS]; static float pad[change]; static double Work[2*MAXLEN]; static double re_table[2*MAXLEN],re_work[4*MAXLEN]; void main() { int i,j,k,l,N,zero,one; float t1,t2; double et,t11[MAXP2],t21[MAXP2]; double rone; double *Rd,*Wk; Rd = &Rdata[0]; Wk = &Work[0]; N=2; et = 0; zero=0; one=1; rone=1.0; for (j = 1; j < MAXP2 ; j++){ N *= 2; ccfft(&zero,&N,&rone,&Work,&Work,&re_table,&re_work,&zero); l = 0; /* Initialize random vector */ for (k = 0; k < 2*MAXFFTS*N;k++) (*(Rd+k)) = rand(); /* Treat as 2d array of vectors of stride MAXFFTS*/ for (k = 0; k < MAXFFTS ;k++){ l = 0; for (i=0;i < N;i++){ (*(Wk+(2*i) )) = (*(Rd+(2*k) +l)); (*(Wk+(2*i+1))) = (*(Rd+(2*k+1)+l)); l += MAXFFTS; } t1 = rtc(); ccfft(&one,&N,&rone,&Work,&Work,&re_table,&re_work,&zero); t2 = rtc(); et += (t2t1)/150000000.0; /*Elapsed time using 150MHz*/ l = 0; for (i=0;i < N;i++){ (*(Rd+(2*k) +l)) = (*(Wk+(2*i) )); (*(Rd+(2*k+1)+l)) = (*(Wk+(2*i+1))); l += MAXFFTS; } } t11[j1]=(1.0/(double)MAXFFTS)*et; /* Averaged elapsed time */ t21[j1]=1.0/(1000000.*t11[j1])*5*N*(j+1); /*MFLOPS from ccfft man page*/ printf(" %4d %10d %e %6.1f\n", j, N, t11[j1],t21[j1]); } }On the 150 MFLOPs peak Alpha processor of the T3D, the performance was about 10% of peak, which was a disappointment. The T3D processor has a direct mapped 8KB cache and shares 2 pages of memory, so there can be significant degradation in performance when the program is missing cache lines or is swapping between pages. This is especially true with the poweroftwo FFT, where it is very likely that consecutive loads map to the same cache line.
To check out the possibility that cache misses or page swapping were partly responsible, I added one line to the above program
static float pad[change];and allowed the value of 'change' to range from 1 to 12. Here is a table of MFLOPs for different values of change:
Performance (MFLOPs) of the T3D routine CCFFT for Chris Yerkes' timing program value of the pad (i.e., "change") trans              form 0 1 2 3 4 5 6 7 8 9 10 11 12 length              4 7.3 7.3 7.3 7.5 7.5 7.6 7.6 7.3 7.2 7.2 7.2 7.5 7.5 8 8.7 8.8 8.6 8.9 8.8 9.1 9.1 8.7 8.6 8.7 8.7 8.8 8.8 16 10.9 11.8 11.7 11.3 11.3 11.6 11.6 10.9 10.8 11.7 11.8 11.2 11.2 32 13.2 14.3 14.2 13.3 13.3 13.3 14.3 12.6 12.6 14.2 14.2 13.3 13.3 64 15.7 17.1 16.7 15.7 15.7 16.6 17.2 15.3 15.4 17.1 17.1 15.7 15.7 128 17.6 19.1 18.8 17.4 17.2 19.2 19.5 17.6 17.6 19.0 19.1 17.2 17.4 256 19.6 21.3 21.3 19.0 18.9 21.5 21.7 19.6 19.5 21.4 21.3 19.0 19.0 512 18.9 19.6 19.6 18.5 18.5 19.8 19.8 18.9 18.9 19.6 19.6 18.5 18.5 1024 14.3 16.4 16.4 14.0 14.0 16.6 16.5 14.3 14.3 16.4 16.4 14.0 14.0 2048 11.4 14.3 14.3 11.2 11.2 14.4 14.3 11.4 11.4 14.3 14.3 11.2 11.2 4096 10.1 12.9 12.9 10.0 10.0 13.0 13.0 10.1 10.1 12.9 12.9 10.0 10.0 8192 8.8 11.0 11.0 8.7 8.7 11.1 11.1 8.8 8.8 11.0 11.0 8.7 8.6 16384 8.4 10.6 10.6 8.3 8.3 10.7 10.7 8.4 8.4 10.6 10.6 8.3 8.3 32768 8.1 10.1 10.1 8.0 8.0 10.2 10.2 8.1 8.1 10.1 10.1 8.0 8.0From the table we have several observations:
 Eventually the transform is so large that performance decreases, this must be a lost of cache locality. (It rarely happens in the YMP vector world that a larger problem is less efficient that a smaller one.)
 The pad is a multiple of 32 bits but from the performance it looks like the allocation is 64 bits at a time.
 A simple 1 word (64 bits) pad gets about a 20% performance boost for the largest transform (32768) for this timing program.
FFT Operation Counts
In the common man page for FFTs on Denali, the operations count of a poweroftwo FFT is approximated as:5 * n * log2( n )where:
 each addition and multiplication is one operation
 n is the length of the complex to complex FFT
 log2 is the base 2 logarithm of n
Operation counts for the CCFFT routine on the YMP Length of Log2 of Estimate of Actual Actual Total Actual Transform the Length operations Additions Multiplies Operations n log2(n) 5*n*log2(n)       4 2 40 31 7 38 8 3 120 81 25 106 16 4 320 173 49 222 32 5 800 471 232 703 64 6 1920 1047 488 1535 128 7 4480 2379 1086 3465 256 8 10240 5275 2270 7545 512 9 23040 12012 5480 17492 1024 10 51200 26476 11880 38356 2048 11 112640 58781 27314 86095 4096 12 245760 127453 58162 185615 8192 13 532480 279150 132156 411306 16384 14 1146880 598894 280124 879018 32768 15 2457600 1295743 625222 1920965 65536 16 5242880 2754124 1313795 4067919There are several reasons why the approximation, 5*n*log2(n), is too generous:
 The algorithm implemented is actually a 2, 4, and 8 radix algorithm not just a radix 2 algorithm.
 The innermost loops of the implemented algorithm are probably interchanged to insure a 'good' vector length.
 Trivial operations like multiplies by 1.0 and additions with 0.0 have been optimized out.
Next week we'll have more results on the way to a twodimensional FFT. There are two and threedimensional FFTs scheduled for the 1.2.1 release of the Programming Environment. That release isn't available yet but we may be able to move to it in July.
New T3D Batch PE Limits
In the past week all active users of the ARSC T3D had their batch PE limit increased to 128. This allows these users access to the 128PE 8hour queues that run on the weekends. If you need your T3D UDB limits changed please contact Mike Ess.New Fortran Compiler
An upgrade version of the cf77 compiler is available on Denali with the path:/mpp/bin/cft77new and /mpp/bin/cf77newFor the default versions we have:
/mpp/bin/cf77 V Cray CF77_M Version 6.0.4.1 (6.59) 05/25/95 13:36:39 Cray GPP_M Version 6.0.4.1 (6.16) 05/25/95 13:36:39 Cray CFT77_M Version 6.2.0.4 (227918) 05/25/95 13:36:39and for this new version:
/mpp/bin/cf77new V Cray CF77_M Version 6.0.4.1 (6.59) 05/25/95 13:37:26 Cray GPP_M Version 6.0.4.1 (6.16) 05/25/95 13:37:26 Cray CFT77_M Version 6.2.0.9 (259228) 05/25/95 13:37:27This new compiler fixes a potential race condition in shared memory accesses and also fixes an inlining problem with the F90 intrinsics, MINLOC and MAXLOC.
This compiler will become the default after we finish testing it and users will be notified before that happens. I encourage users to try this compiler before it becomes the default.
List of Differences Between T3D and YMP
The current list of differences between the T3D and the YMP is: Data type sizes are not the same (Newsletter #5)
 Uninitialized variables are different (Newsletter #6)
 The effect of the a static compiler switch (Newsletter #7)
 There is no GETENV on the T3D (Newsletter #8)
 Missing routine SMACH on T3D (Newsletter #9)
 Different Arithmetics (Newsletter #9)
 Different clock granularities for gettimeofday (Newsletter #11)
 Restrictions on record length for direct I/O files (Newsletter #19)
 Implied DO loop is not "vectorized" on the T3D (Newsletter #20)
 Missing Linpack and Eispack routines in libsci (Newsletter #25)
 F90 manual for YMP, no manual for T3D (Newsletter #31)
 RANF() and its manpage differ between machines (Newsletter #37)
Current Editors:
Email Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 9074508669 Kate Hedstrom ARSC Oceanographic Specialist ph: 9074508678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 997756020

Subscribe to (or unsubscribe from) the email edition of the
ARSC HPC Users' Newsletter.

Back issues of the ASCII email edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.