ARSC T3D Users' Newsletter 85, May 3, 1996
How Much does that Division Cost?
The DEC Alpha processor is an IEEE processor. It supports both the formats and arithmetic functions as defined by the ANSI/IEEE STD 7541985 document as produced by the FloatingPoint Working Group of the Microprocessor Standards Subcommittee of the Standards Committee of the IEEE Computer Society. This document specifies what is colloquially called 'IEEE arithmetic'. I can U.S. mail copies of this document to whomever requests it.
The standard specifies that all operations should be computed with precision within 1/2 ulp (ulp  unit of least precision), provide 4 modes of rounding, implement special quantities of Infinity, NaNs and Signed Zero, produce correct signals at specified exceptions, and allow users to implement trap handlers when an exception occurs. The standard treats the four arithmetic functions of addition, subtraction, multiplication and division as fundamental functions requiring the same precision from each function.
From the point of view of the hardware engineer, however, the four functions differ considerably in circuitry costs and execution speeds. Below is a table I've accumulated for these arithmetic functions on various machines:
<execution speeds> machine addition multiplication division Univac 1108 2.2 ms 3.0 ms 8.63 ms IBM 360/85 .38 ms 1.36 ms 1.64 ms CDC 6600 .4 ms 1. ms 2.9 ms VAX 11/750(inC) 10.6 ms 15.7 ms 15.7 ms Burroughs BSP 320 ns 320 ns 1280 ns Cray1 75 ns 87.5 ns 362.5 ns CDC 205 5 cps 5 cps 54 cps CrayXMP 6 cps 7 cps 29 cps intel 387 2937 cps 3257 cps 94 cps Cray T3D 6 cps 6 cps 61 cps ms = microseconds, ns = nanoseconds, cps = clock periodsCertainly as implemented, each operation takes varying time and underlying that they are implemented with varying amounts of silicon. Of course, it's the last bits of each result that are hardest to get correct and there were significant speed .vs. precision tradeoffs before the IEEE standard preempted them. For example, for the Cray1 with only 48 bits in the mantissa, (52 for IEEE double precision) we had:
addition is chopped multiplication is "statistically" rounded division is a reciprocal followed by a multiply, (could be off by 2 ulps)But there are still tradeoffs to be made in the software for the most expensive operation, division. There are always several options:
 use 32 or 64 bits (except on the Cray vector machines)
 use optimized library functions
 do a non IEEE divide
a / b with a * ( 1. / b )either in the hardware or by the compiler, but there are times when the denominator is constant within a loop and then such a transformation seems reasonable:
do 10 i = 1, large_number a( i ) = b( i ) / scale 10 continuecould be replaced by:
qscale = 1.0 / scale do 10 i = 1, large_number a( i ) = b( i ) * qscale 10 continueHere we have a
large_number
of divisions being replaced by 1 division and a
large_number
of multiplications. The table below presents timings on a single T3D PE for 1 million "divisions" in a loop with various options:
varying denominators constant denominator   f90 64 bits .549 .501 cft77 o ieeedivide .543 .493 cft77 o noieeedivide .543 .108 benchlib oneover_v function .336 not applicable f90 32 bits .305 .277Notes on the table:
 cf77/cft77 does not support 32 bit arithmetic (too bad!)
 The F90 compiler does not support the onoieeedivide switch (too bad!!)

The benchlib routine oneover_v (newsletters #34 (5/5/95) and #74 (2/16/96) is a replacement for:
do i = 1, n a( i ) = 1.0 / b( i ) enddo as call oneover_v( n, a, b )
parameter( MAX = 1000000 ) real qieee( MAX ), qnoieee( MAX ), qbench( MAX ), num( MAX ) real*4 q32ieee( MAX ), rnum( MAX ) do i = 1, MAX qieee( i ) = ranf() qnoieee( i ) = qieee( i ) qbench( i ) = qieee( i ) q32ieee( i ) = qieee( i ) num( i ) = ranf() rnum( i ) = num( i ) enddo call ieee( MAX, num, qieee, tieee ) call noieee( MAX, num, qnoieee, tnoieee ) error1 = 0.0 do i = 1, MAX error1 = error1 + ( qieee( i )  qnoieee( i ) ) ** 2 enddo call ieee32( MAX, rnum, q32ieee, t32ieee ) error2 = 0.0 do i = 1, MAX error2 = error2 + ( qieee( i )  q32ieee( i ) ) ** 2 enddo call bench( MAX, num, qbench, tbench ) error3 = 0.0 do i = 1, MAX error3 = error3 + ( qieee( i )  qbench( i ) ) ** 2 enddo write( 6, 601 ) MAX, tieee, tnoieee, t32ieee, tbench 601 format( i8, 4f6.3 ) write( 6, 602 ) MAX, sqrt(error1), sqrt(error2), sqrt(error3) 602 format( i8, 3e16.8 ) end real function second( ) second = dble( irtc( ) ) / 150000000.0 end subroutine ieee( length, num, q, time ) real q( length ), num( length ) time = second() do i = 1, length q( i ) = num( i ) / q( i ) enddo time = second()  time end subroutine ieee32( length, num, q, time ) real*4 q( length ), num( length ) time = second() do i = 1, length q( i ) = num( i ) / q( i ) enddo time = second()  time end subroutine bench( length, num, q, time ) real q( length ), num( length ) time = second() call oneover_v( length, q, q ) do i = 1, length q( i ) = num( i ) * q( i ) enddo time = second()  time end subroutine bench( length, num, q, time ) real q( length ), num( length ) time = second() call oneover_v( length, q, q ) do i = 1, length q( i ) = num( i ) * q( i ) enddo time = second()  time end
The Linpack Benchmark on the T3D
Often times, Linpack is dismissed as a 'one loop benchmark' because all of the performance numbers are determined by a single nest of DO loops. That nest is the core of Gaussian Elimination and looks like:do 30 j = kp1, n t = a(l,j) if (l .eq. k) go to 20 a(l,j) = a(k,j) a(k,j) = t 20 continue c call saxpy(nk,t,a(k+1,k),1,a(k+1,j),1) do 21 i = 1, nk a(k+i,j)=t*a(k+i,k)+a(k+i,j) 21 continue 30 continuewhere I have replaced the call to the saxpy BLAS1 routine with an equivalent DO loop, much like several inliners would do. But even with such a simple loop there are lots of possibly effective optimizations. Some of the possibilities I investigated were:
A. Asis  The source code from Oak Ridge National Labs without any modifications B. Libsci  The source code with the BLAS1 routines saxpy, sdot and isamax blas deleted, so that they will be supplied by the optimized versions in libsci.a. C. Libsci  The source code with the BLAS1 routines saxpy, sdot, and isamax and sgefa and sgesl of linpack deleted, so that they will be supplied by the optimized versions in libsci.a. (not an option on the T3D, as the linpack library is not part of libsci.a) D. Inline1  Use either the cf77 inliner or vi inlining to replace calls to saxpy, sdot, and isamax. E. Inline2  Use the cf77 inliner and fpp to replace the entire DO loop nest with a call to the BLAS2 routine SGERX. The above loop is replaced with: CDIR@ IVDEP DO 30 J = 1, N  KP1 + 1 T1U(J) = A(L,KP1+J1) IF (L .NE. K) THEN A(L,KP1+J1) = A(K,KP1+J1) A(K,KP1+J1) = T1U(J) ENDIF J3X = N  K 30 CONTINUE CALL SGERX@(J3X,NKP1+1,1.,A(1+K,K),1,T1U,1,A(1+K,KP1),1,LDA) A simply remarkable transformation! F. Lapack Replace the calling sequence from the Linpack library routines of: call sgefa(a,lda,n,ipvt,info) call sgesl(a,lda,n,ipvt,b,0) with the equivalence sequence from Lapack library routines: call sgetrf( n, n, a, lda, ipvt, info ) call sgetrs( "n", n, 1, a, lda, ipvt, b, n, info )A summary of the performance of these methods is given below where the speed is given in MFLOPS for solving two systems of linear equations: 100x100 and 1000x1000.
<100x100 case> <1000x1000 case> T3D:1PE T3D:1PE YMP:M98 T3D:1PE T3D:1PE YMP:M98 rdahead rdahead A. original source 12.4 14.1 25.0 10.7 11.4 95.0 B. libsci blas (BLAS1 fcns) 13.8 20.4 47.0 25.2 28.3 118.2 C. libsci linpack 65.8 97.2 D. inline1 (BLAS1 fcns) 13.7 15.6 45.4 10.9 11.3 87.5 E. inline2 (BLAS2 fcns) 25.8 26.9 120.3 35.4 35.5 190.8 F. Lapack 35.7 36.9 125.0 67.1 67.3 266.7There are several notes to make about the results in the above table:
 Lapack is a good deal. If you're still back in the Stone Age using linpack routines, you are shooting yourself in the foot. The T3D Software Development Group took a courageous (but wise and correct) policy step in NOT providing the linpack library on the T3D.
 The T3D BLAS2 routine STRSM has a bug in it and originally the T3D aborts on the 1000x1000 case. But the beauty of working with a library like Lapack (or Linpack) is that you don't have to wait for the vendor to fix it. You can just go to the Oak Ridge National Labs anonymous ftp server (netlib2.cs.utk.edu) and get the source and compile it into your program as a replacement for the vendor's buggy version. Then if the vendor ever fixes the bug you can delete the linking to your source version.

Getting inlining to the SGER (or SGERX) level on the T3D isn't available on the T3D because the fpp pass of cf77 is only for the Cray Parallel Vector Processors (PVP). But at least in this case we can use the transformational abilities of fpp to produce code for the T3D with something like:
t3dline: rm linpacks.j (export TARGET=crayymp; cf77 c Wde68 Zp J linpacks.f ) mv linpacks.j linpacksj.f (export TARGET=crayt3d; cf77 c linpacksj.f second.f ) mppldr o inline linpacksj.o second.o inline > results.100.nrdw mppldr D"rdahead=on" o inline linpacksj.o second.o inline > results.100.rdw
The J flag on the cf77 compiler is used to produce the source listing that goes into the T3D version of cf77 in the above makefile rule.  With CRI's use of read ahead buffers on the T3E it's a good idea to run with and without them to get some experience on which applications will benefit.
Multiprocessors Efforts  Memory Limitations
On the T3D we need to understand the uniprocessor results before we go on to the more challenging problem of using multiprocessors. The first question is: "What can Craft Fortran do for us?". We'll have to determine how to distribute the arrays to the multiple processors and then attack the above loop nest to share the work among the processors.The array that matters for this application is the 2 dimensional coefficient matrix that describes the system of linear equations. The linpack benchmark overwrites this matrix with a factorization and then that factorization is used to solve for one righthand side. I believe the only array distribution that makes sense for MPPs for this problem is the column cyclical distribution.
In Craft Fortran:
parameter( lda = 128 ) real a( lda, lda ) cdir$ shared aa( :, :block(1) )In High Performance Fortran:
parameter( lda = 128 ) real a( lda, lda ) !hpf$ distribute a( *, cyclic )This distribution (in ASCII) for a 16 by 16 matrix distributed on 4 processors looks like:
Column number 1 2 3 4 5 6 7 8 13 14 15 16  R 1 pe0 pe1 pe2 pe3 pe0 pe1 pe2 pe3 . . . . pe0 pe1 pe2 pe3 o 2 pe0 pe1 pe2 pe3 pe0 pe1 pe2 pe3 . . . . pe0 pe1 pe2 pe3 w 3 pe0 pe1 pe2 pe3 pe0 pe1 pe2 pe3 . . . . pe0 pe1 pe2 pe3 4 pe0 pe1 pe2 pe3 pe0 pe1 pe2 pe3 . . . . pe0 pe1 pe2 pe3 n 5 pe0 pe1 pe2 pe3 pe0 pe1 pe2 pe3 . . . . pe0 pe1 pe2 pe3 u 6 pe0 pe1 pe2 pe3 pe0 pe1 pe2 pe3 . . . . pe0 pe1 pe2 pe3 m 7 pe0 pe1 pe2 pe3 pe0 pe1 pe2 pe3 . . . . pe0 pe1 pe2 pe3 b 8 pe0 pe1 pe2 pe3 pe0 pe1 pe2 pe3 . . . . pe0 pe1 pe2 pe3 e 9 . . . . . . . . . . . . . . . . r 10 . . . . . . . . . . . . . . . . 11 . . . . . . . . . . . . . . . . 12 . . . . . . . . . . . . . . . . 13 pe0 pe1 pe2 pe3 pe0 pe1 pe2 pe3 . . . . pe0 pe1 pe2 pe3 14 pe0 pe1 pe2 pe3 pe0 pe1 pe2 pe3 . . . . pe0 pe1 pe2 pe3 15 pe0 pe1 pe2 pe3 pe0 pe1 pe2 pe3 . . . . pe0 pe1 pe2 pe3 16 pe0 pe1 pe2 pe3 pe0 pe1 pe2 pe3 . . . . pe0 pe1 pe2 pe3 This distribution has two features that no other distribution will have:
 Columns of the matrix will be completely contained on one processor. This will take advantage of the column orientation of the factorization and allow us to use cache efficiently on each processor.
 As the factorization proceeds, it updates submatrices of the full matrix from the entire matrix into smaller submatrices from the upper left corner to the lower right corner. With the cyclic distribution of columns, this means the work is load balanced as each PE will have approximately the same number of columns to update.
One more constraint has to be handled. For this distribution, Craft works only on arrays whose dimensions are powers of two. So for this problem we have:
the 100x100 problem is solved on a 128x128 array the 1000x1000 problem is solved on a 1024x1024 array the 2000x2000 problem is solved on a 2048x2048 array the 4000x4000 problem is solved on a 4096x4096 arrayUsing the CDIR$ MASTER/CDIR$ ENDMASTER directives we can use this distribution of arrays on any number of processors but keep all of the computation on a single processor. Of course, when PE0 needs an element located on another processor, that access will not be as fast as accessing a local element. But we still get the right answers, which is always a good start.
MFLOPS for a single PE computing the linpack solution on a problem distributed among many PEs # of PEs <problem size> that contain 100x100 1000x1000 2000x2000 4000x4000 the problem     1 9.58 2.76 too big too big 2 1.42 2.22 1.43 too big 4 1.46 1.47 1.47 too big 8 1.45 1.46 1.48 1.47 16 1.44 1.45 1.45 1.45 32 1.41 1.43 1.43 1.43The table shows decreasing performance as more and more of the problem is distributed off of PE0. And now we have the dubious distinction of having solved the largest linpack problem on a single Alpha processor at abysmal speed. Now that we understand the uniprocessor performance and the memory limitations, we can concentrate on getting multiprocessor speedup using Craft and HPF. Stay tuned in for the next newsletter.
Current Editors:
Email Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 9074508669 Kate Hedstrom ARSC Oceanographic Specialist ph: 9074508678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 997756020

Subscribe to (or unsubscribe from) the email edition of the
ARSC HPC Users' Newsletter.

Back issues of the ASCII email edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.