ARSC T3D Users' Newsletter 41, June 23, 1995
The 2D FFT on the ARSC T3D
Srikanth Thirumalai of the Mathematical Software Group at Cray Research sends in the following note about CRI library routines for the two dimensional Fast Fourier Transforms on the T3D:
> This is just to inform you that Craylibs 1.2.1 has 2 > routines PCCFFT2D and PCCFFT3D for 2D and 3D parallel > complextocomplex FFTs. I have included the performance > numbers for PCCFFT2D for your convenience. > > Craylibs 1.2.2 will have similar routines for > realtocomplex and complextoreal FFTs. They have been > called PSC(CS)FFT2D(3D). > > Performance of PCCFFT2D in Mflops. > > Side of > square <number of PEs> > array > 1 2 4 8 16 32 64 128 > > 32 30.5 39.7 60.0 75.9 90.4 90.6 89.1 88.3 > 64 41.8 59.9 106.1 163.9 237.5 302.4 278.7 278.2 > 128 42.0 66.7 127.6 226.4 407.6 652.3 834.3 776.9 > 256 42.4 71.3 138.3 259.9 483.1 870.4 1433.4 1989.6 > 512 33.2 59.6 116.3 227.2 432.8 808.5 1438.7 2613.9These numbers are better than those published in the last newsletter. ARSC is moving up to the the 1.2.1 release of Craylibs and ARSC users will be informed through this newsletter when ARSC upgrades.
Small Changes to the ARSC T3D Batch Queues
On June 16th we made some small adjustments to the ARSC T3D batch queues, the T3D queues are now described on Denali with the command:news t3dbatchwhich produces:
The description of the ARSC T3D NQS queues is: New T3D Batch Queues ==================== The T3D batch queues were changed on June 16th 1995. The current T3D queues are: Always on: m_8pe_24h 2 jobs using at most 8 PEs for 24 hours m_16pe_24h 2 jobs using at most 16 PEs for 24 hours m_32pe_24h 1 job using at most 32 PEs for 24 hours m_64pe_10m 1 job using at most 64 PEs for 10 minutes m_64pe_24h 1 job using at most 64 PEs for 24 hours m_128pe_5m 1 job using at most 128 PEs for 5 minutes There is one additional queue that is enabled on Friday at 6PM and disabled at 4AM on Sunday: m_128pe_8h 1 job using at most 128 PEs for 8 hours A request made to these queues will be run as soon as enough PEs are available to satisfy the request. User's UDBSEE limits Most T3D users currently have a limit of 128 PEs for batch access. Users can check their limits with the udbsee command: udbsee grep jpelimit The output will indicate their limits in interactive (i) and batch (b). For example: jpelimit[b] :128: jpelimit[i] :8: If your batch PE limit is too small to access these new NQS queues and you would like to use them, please contact Mike Ess, either by phone at 9074745404 or email to ess@arsc.edu, to have your PE batch limits increased. Users can query the NQS batch system with the command: qstat a to see what other NQS T3D jobs are scheduled to run on the T3D. The utility mppmon is available to see what jobs are currently running on the T3D. T3D jobs are executed on a "first fit" priority and run to completion without interruption. Mike Ess, June 19, 1995
Big Jobs in the ARSC T3D NQS Queues
At ARSC it is possible to use all 128 PEs either through the two 128 PE batch queues or for some users during interactive sessions. When submitting such jobs it is a good idea to use the mppmon command to see who is using the machine before you submit your 128 PE job. The two ideas to keep in mind are: Your job won't run until the machine becomes idle. Even a single PE job will make your 128 PE job wait until the machine is idle.
 All jobs submitted after your 128 PE job will wait until your 128 PE job completes. The combination of a long running job (any number of PEs) followed by a 128 PE job (for any amount of time) will effectively close down the T3D until both the long running job AND the 128 PE job finishes.
The EPCC/CRI Version of MPI on the ARSC T3D
The three files necessary for running MPI programs on the ARSC T3D have been put in their default locations:mpi.h and mpif.h go to /usr/include/mpp libmpi.a goes to /mpp/libNow (when the environmental variable TARGET is set to crayt3d) MPI C programs can be compiled as:
cc c shifter.c cc shifter.o lmpiand similarly for MPI Fortran programs:
cf77 c I/usr/include/mpp shifterF.f cf77 shifterF.o lmpiIf there are any problems with MPI on the ARSC T3D please contact Mike Ess.
Sorting on the T3D
A user asked: What type of sorting functions are available on the T3D? On the YMP, there are several sorting functions: ISORTD, SSORTB, ISORTB and ORDERS. But looking over the decks that are in the MPP Craylibs, I could only find qsort in /mpp/lib/libc.a. qsort requires the user to supply a function that compares the elements of the array to be sorted so I doubt that it can be very fast sorting on a specific instance, like an array of integers.There can be a big speed difference in sorting functions and to start an investigation I typed in several functions from: "Numerical Recipes in C", by Press, Flannery, Teukolsky & Vetterling and timed them on a single PE on the T3D. The functions in this book are translations from the book "Numerical Recipes in Fortran" and assume that the elements to be sorted are in the array a[1:n]. I changed the functions to sort the array when the elements are a[0:n1] which seems more natural in C. Each function has inputs of the length of an array of integers and the starting address of the array, the function overwrites the input array with the sorted array.
The timing of sorting functions is very sensitive to how "orderly" the input array is but for input arrays generated with the random generator RANF (newsletter #37) sorting times below are typical:
A timing comparison of sorting functions on the T3D (seconds) Number of insertion shell quicksort Munstock's elements to sort sort sort sort sort 1 0.000004 0.000010 0.000005 0.000004 2 0.000005 0.000012 0.000006 0.000008 3 0.000006 0.000013 0.000006 0.000008 4 0.000006 0.000017 0.000007 0.000011 5 0.000007 0.000018 0.000008 0.000013 10 0.000009 0.000024 0.000012 0.000021 20 0.000017 0.000043 0.000022 0.000045 30 0.000026 0.000072 0.000032 0.000079 40 0.000041 0.000099 0.000044 0.000123 50 0.000055 0.000133 0.000056 0.000160 100 0.000194 0.000327 0.000122 0.000364 200 0.000645 0.000846 0.000270 0.000873 300 0.001543 0.001358 0.000423 0.001687 400 0.002762 0.001795 0.000595 0.002505 500 0.004387 0.002507 0.000759 0.003231 1000 0.016640 0.005644 0.001750 0.008732 2000 0.075755 0.014613 0.004072 0.026065 3000 0.194306 0.024549 0.007052 0.047113 4000 0.389964 0.034641 0.010076 0.070686 5000 0.611296 0.045975 0.013453 0.095750 10000 2.698027 0.106615 0.032286 0.224263 20000 11.086379 0.242284 0.077115 0.535524 30000 25.146869 0.382456 0.126676 0.892756 40000 44.656612 0.560791 0.180124 1.307390 50000 69.894340 0.719300 0.235710 1.781510The table shows that making the right algorithm choice can make a big difference. Below is the source code for the timer/tester and the source for each of the sorting functions. If any user has more results in this area I would be happy to put them in this newsletter.
Often times in sorting, it is not so useful to have the sorted array overwrite the unsorted input array. It is useful to have an index array whose values are the position of the array element in the sorted array. With this index, the sorted array could be printed out as:
for( i = 0; i < n; i++ ) printf( " %d %d\n", i, a[ index[ i ] ] );The "Numerical Recipes in C" book shows how to modify the insertion, shell and heap sort functions to return this index that sorts the input array. The function "munstock" is this kind of sorting function, it produces an index to sort the array. (I got this function from Jim Munstock of CRI more than ten years ago in CDC assembler. The translation to C is straightforward and it's a useful utility to keep around.):
#define MAXLENGTH 100000 #define MAXCASE 31 main() { int a[ MAXLENGTH ], b[ MAXLENGTH ], c[ MAXLENGTH ], d[ MAXLENGTH ]; int e[ MAXLENGTH ], indx[ MAXLENGTH ]; int i, j, n; fortran double RANF(); void insert(), heap(), shell(), munstock(); double t1, t2, t3, t4, t5, second(); int ncase,kcase[MAXCASE] = {0,1,2,3,4,5,10,20,30,40,50,100,200,300,400,500,1000,2000,3000,4000,5000,10000,20000,30000,40000,50000,100000,200000,300000,400000,500000}; for( ncase = 0; ncase < 27; ncase++ ) { n = kcase[ ncase ]; for( i = 0; i < n; i++ ) { a[ i ] = n * RANF(); b[ i ] = a[ i ]; c[ i ] = a[ i ]; d[ i ] = a[ i ]; e[ i ] = a[ i ]; } t1 = second( ); insert( n, b ); t2 = second( ); shell( n, c ); t3 = second( ); heap( n, d ); t4 = second( ); munstock( n, e, indx ); t5 = second( ); if( n > 0 ) { for( j = 0; j < n1; j++ ) { if( b[ j ] > b[ j+1 ] ) { printf( " failure at insert sort %d %d %d\n", j, a[ j ], b[ j ] ); for( i = 0; i < n; i++ ) printf( " %d %d %d\n", i, a[ i ], b[ i ] ); exit( 1 ); } } for( j = 0; j < n; j++ ) { if( c[ j ] != b[ j ] ) { printf( " failure at shell sort %d %d %d\n", j, b[ j ], c[ j ] ); for( i = 0; i < n; i++ ) printf( " %d %d %d\n", i, a[ i ], c[ i ] ); exit( 1 ); } } for( j = 0; j < n; j++ ) { if( d[ j ] != b[ j ] ) { printf( " failure at heap sort %d %d %d\n", j, b[ j ], d[ j ] ); for( i = 0; i < n; i++ ) printf( " %d %d %d\n", i, a[ i ], d[ i ] ); exit( 1 ); } } for( j = 0; j < n; j++ ) { if( e[ indx[ j ] ] != b[ j ] ) { printf( " failure at index sort %d %d %d\n",j,indx[j],e[indx[j]]); for( i = 0; i < n; i++ ) printf( " %d %d %d\n", i, b[i],e[indx[i]]); exit( 1 ); } } } printf( " %2d %6d %12.6f %10.6f %10.6f %10.6f\n",ncase,n,t2t1,t3t2,t4t3,t5t4); } } void insert( n, arr ) /* sorts the array arr[0:n1] in ascending order with insertion sort insertion sorting from "Numerical Recipes in C", Press,Flannery,Teukolsky & Vetterling, modified for "C" arrays by Mike Ess, 1995 */ int n; int arr[]; { int i, j, a; if( n < 1 ) { printf( "insert sort called with elements to sort less than 1\n" ); } else { for( j = 1; j < n; j++ ) { /* pick out each element */ a = arr[ j ]; i = j  1; while( i >= 0 && arr[ i ] > a ) { /* look for place to insert it */ arr[ i + 1 ] = arr[ i ]; i; } arr[ i + 1 ] = a; /* insert it */ } } } #include <math.h> #define ALN2I 1.442695022 #define TINY 1.0e5 void shell( n, arr ) /* sorts the array arr[0:n1] in ascending order with shell sort shell sorting from "Numerical Recipes in C", Press,Flannery,Teukolsky & Vetterling, modified for "C" arrays by Mike Ess, 1995 */ int n; int arr[]; { int nn, m, j, i, lognb2; int t; if( n < 1 ) { printf( "shell sort called with elements to sort less than 1\n" ); } else { lognb2 = (log((double) n) * ALN2I + TINY); m = n; for( nn = 1; nn <= lognb2; nn++ ) { /* Loop over partial sorts */ m >>= 1; for( j = m; j < n ; j++ ) { /* Outer loop of straight insertion */ i = jm; t = arr[ j ]; while( i >= 0 && arr[ i ] > t ) { /* Inner loop of straight insertion */ arr[ i + m ] = arr[ i ]; i = m; } arr[ i + m ] = t; } } } } void heap( n, ra ) /* sorts the array ra[0:n1] in ascending order with heap sort heap sorting from "Numerical Recipes in C", Press,Flannery,Teukolsky & Vetterling, modified for "C" arrays by Mike Ess, 1995 */ int n; int ra[]; { int l, j, ir, i; int rra; if( n < 1 ) { printf( "heap sort called with elements to sort less than 1\n" ); } else { l = n / 2 + 1; ir = n; /* The index l will be decremented from its initial value down to 1 during the "hiring" (heap creation) phase. Once it reaches 1, the index ir will be decremented from its initial value down to 1 during the "retirementand promotion" (heap selection) phase */ for( ;; ) { if( l > 1 ) { /* Still in hiring phase */ l; rra = ra[ l1 ]; } else { /* In retirementandpromotion phase. */ rra = ra[ ir1 ]; /* Clear a space at the end of array. */ ra[ ir1 ] = ra[ 0 ]; /* Retire the top of the heap into it. */ ir; /* Done with the last promotion. */ if( ir == 0 ) { /* The least competent worker of all! */ ra[ 11 ] = rra; return; } } i = l; /* Whether we are in the hiring phase */ j = l + l; /* or promotion phase, we here set up to*/ while( j <= ir ) { /* shift down element rra to its proper */ if( j < ir ) { /* level. */ if( ra[ j1 ] < ra[ j ] ) j++; /* Compare to the better underling */ } if( rra < ra[ j1 ] ) { /* demote rra */ ra[ i1 ] = ra[ j1 ]; i = j; j = j + j; } else { j = ir + 1; /* This is rra's level. Set j to term */ } /* inate the siftdown */ } ra[ i1 ] = rra; /* Put rra into its slot. */ } } } void munstock( length, a, ind ) /* Sort the array "a" to produce an index "ind" of the sorted array. From Jim Munstock, translated to C by Mike Ess 1989 */ int length; int a[ ]; int ind[ ]; { int i, ii, ij, j, m, m1, n2; int t; for ( i = 0 ; i < length; i++ ) ind[ i ] = i; m = 1; n2 = length / 2; m = 2 * m; while ( m <= n2 ) m = 2 * m; m = m  1; three:; m1 = m + 1; for ( j = m11 ; j < length; j++ ) { t = a[ ind[ j ] ]; ij = ind[ j ]; i = j  m; ii = ind[ i ]; four:; if ( t < a[ ii ] ) { ind[ i+m ] = ii; i = i  m; if ( i >= 0 ) { ii = ind[ i ]; goto four; } } ind[ i+m ] = ij; } m = m / 2; if ( m > 0 ) goto three; return; }
List of Differences Between T3D and YMP
The current list of differences between the T3D and the YMP is: Data type sizes are not the same (Newsletter #5)
 Uninitialized variables are different (Newsletter #6)
 The effect of the a static compiler switch (Newsletter #7)
 There is no GETENV on the T3D (Newsletter #8)
 Missing routine SMACH on T3D (Newsletter #9)
 Different Arithmetics (Newsletter #9)
 Different clock granularities for gettimeofday (Newsletter #11)
 Restrictions on record length for direct I/O files (Newsletter #19)
 Implied DO loop is not "vectorized" on the T3D (Newsletter #20)
 Missing Linpack and Eispack routines in libsci (Newsletter #25)
 F90 manual for YMP, no manual for T3D (Newsletter #31)
 RANF() and its manpage differ between machines (Newsletter #37)
 CRAY2IEG is available only on the YMP (Newsletter #40)
 Missing sort routines on the T3D (Newsletter #41)
Current Editors:
Email Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 9074508669 Kate Hedstrom ARSC Oceanographic Specialist ph: 9074508678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 997756020

Subscribe to (or unsubscribe from) the email edition of the
ARSC HPC Users' Newsletter.

Back issues of the ASCII email edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.