ARSC T3D Users' Newsletter 102, August 30, 1996
f90 (It's not just the law...)
In case you hadn't heard...
 CF77 will not exist on the T3E.
 CF77 is being phased out on all CRI PVP systems. With the release of the programming environment 2.0, CF90 replaces the CF77 compiling system. (ARSC plans to upgrade to PE2.0 later this fall.)
Your Fortran 77 codes should compile under CF90, but you might want to make the switch earlier rather than later.
Numerical Recipes in Fortran 90 Available
[ Taken from a posting to comp.parallel ]A message from the authors of Numerical Recipes: Our new book, "Numerical Recipes in Fortran 90: The Art of Parallel Scientific Computing" and our new "Numerical Recipes Code CDROM" are both out and available now from Cambridge University Press. We've put a lot of effort into these, and we hope you like them! Here are brief descriptions: "Numerical Recipes in Fortran 90: The Art of Parallel Scientific Computing", Volume 2 of Fortran Numerical Recipes. This new volume, intended for use with the existing book (now renamed Numerical Recipes in Fortran 77), reworks all the Numerical Recipes routines to use Fortran 90's concise parallel language constructions. Even on single processor machines, you get the benefit of a slick, modern version of Fortran, and new conciseness and clarity in the code. There are also three new chapters on Fortran 90 language features and parallel programming methods, and an introduction by Michael Metcalf. More information on the book is available at: http://nr.harvard.edu/nr/nrf90_blurb.html
Simple Vector Operations
[ One of our T3D users, Dr. Alan Wallcraft of Stennis Space Center, contributes this article. ]Current "workstation" Fortran compilers seem to do a relatively poor job with simple vector operations on REAL*4. This is a problem because Cray vector codes, Fortran 90 codes, and High Performance Fortran (or CM Fortran) codes contain many such operations. Using optimized BLAS can help, but some vendors don't optimize level1 BLAS and Cray has minimal support for 32bit BLAS on the T3D.
There is a need for a standard set of vector subroutines, like BLAS but not just for linear algebra, that could be optimized for each machine. To illustrate the problem, consider A = S and A = B where A and B are vectors and S a scalar. These operations are quite common, although they can often be avoided by code restructuring. I expected compilers to produce almost optimal code for such operations. However, using REAL*8 assignment instead of REAL*4 is 1.5 to 2 times faster on many machines and this can be achieved using "almost standard" f77. A = S is the simplest example. Here is the test program. The LOC function is nonstandard, but almost always available, and may be either INTEGER*4 or INTEGER*8 depending on the machine. It is used to detect how A is aligned w.r.t. REAL*8 word boundaries.
PROGRAM WSETST IMPLICIT NONE C INTEGER NP,NN PARAMETER (NP=22, NN=2**NP) C INTEGER IP,L,N REAL*4 A(NN+8) REAL*8 SECOND REAL*8 T0,T1,T2 C REAL*4 ZERO4 PARAMETER (ZERO4=0.0) C C PROGRAM TIMING A(1:N) = 0.0, WITH A IN CACHE (IF IT FITS). C C R4WSET  SUBROUTINE USING REAL*4 ASSIGNMENT C R4WSET8  SUBROUTINE USING REAL*8 ASSIGNMENT C DO IP= 1,NP N = 2**IP C CALL R4WSET(A,ZERO4,N+8) C T0 = SECOND() C DO L= 1,NN,N CALL R4WSET(A(1),ZERO4,N) CALL R4WSET(A(2),ZERO4,N) CALL R4WSET(A(3),ZERO4,N) CALL R4WSET(A(4),ZERO4,N) CALL R4WSET(A(5),ZERO4,N) CALL R4WSET(A(6),ZERO4,N) CALL R4WSET(A(7),ZERO4,N) CALL R4WSET(A(8),ZERO4,N) ENDDO C T1 = SECOND() C DO L= 1,NN,N CALL R4WSET8(A(1),ZERO4,N) CALL R4WSET8(A(2),ZERO4,N) CALL R4WSET8(A(3),ZERO4,N) CALL R4WSET8(A(4),ZERO4,N) CALL R4WSET8(A(5),ZERO4,N) CALL R4WSET8(A(6),ZERO4,N) CALL R4WSET8(A(7),ZERO4,N) CALL R4WSET8(A(8),ZERO4,N) ENDDO C T2 = SECOND() WRITE(6,6000) N,NN*32.E6/(T1T0), + NN*32.E6/(T2T1),(T1T0)/(T2T1) ENDDO C 6000 FORMAT(2X,'N = ',I8, + 3X,'R4WSET,R4WSET8 =',F8.2,',',F8.2,' MB/s', + 3X,'SPEEDUP =',F5.2) END SUBROUTINE R4WSET(S,W,N) IMPLICIT NONE INTEGER N REAL*4 S(N),W C C S = W. C INTEGER I C DO I= 1,N S(I) = W ENDDO RETURN END SUBROUTINE R4WSET8(S,W,N) IMPLICIT NONE INTEGER N REAL*4 S(N),W C C S = W. C C LOC IS MACHINE DEPENDENT, ASSUMED TO RETURN ADDRESS IN BYTES. C INTEGER*4 LOC,IS1,I8 * INTEGER*8 LOC,IS1,I8 PARAMETER (I8=8) REAL*8 W8(1) REAL*4 W4(2) EQUIVALENCE (W8,W4) C W4(1) = W W4(2) = W IS1 = LOC(S(1)) IF (MOD(IS1,I8).EQ.0) THEN CALL R8WSET(S(1),W8,N/2) S(N) = W ELSE S(1) = W CALL R8WSET(S(2),W8,(N1)/2) S(N) = W ENDIF RETURN END SUBROUTINE R8WSET(S,W,N) IMPLICIT NONE INTEGER N REAL*8 S(N),W C C S = W. C INTEGER I C DO I= 1,N S(I) = W ENDDO RETURN END REAL*8 FUNCTION SECOND() IMPLICIT NONE C C EMULATION OF CDC'S SECOND TIMING ROUTINE. C * * UNIX VERSION * REAL*4 TARRAY(2) REAL*4 ETIME SECOND = ETIME(TARRAY) * * T3D VERSION * * INTEGER IRTC * SECOND = IRTC() * 6.6E9 RETURN END
On each machine, this is compiled using high optimization (including automatic loop unrolling, but excluding subroutine inlining).
Cray T3D results:
N = 2 R4WSET,R4WSET8 = 16.96, 10.49 MB/s SPEEDUP = 0.62 N = 4 R4WSET,R4WSET8 = 32.93, 19.88 MB/s SPEEDUP = 0.60 N = 8 R4WSET,R4WSET8 = 61.86, 36.51 MB/s SPEEDUP = 0.59 N = 16 R4WSET,R4WSET8 = 110.06, 68.32 MB/s SPEEDUP = 0.62 N = 32 R4WSET,R4WSET8 = 174.05, 119.27 MB/s SPEEDUP = 0.69 N = 64 R4WSET,R4WSET8 = 231.35, 195.69 MB/s SPEEDUP = 0.85 N = 128 R4WSET,R4WSET8 = 277.62, 282.63 MB/s SPEEDUP = 1.02 N = 256 R4WSET,R4WSET8 = 306.82, 367.51 MB/s SPEEDUP = 1.20 N = 512 R4WSET,R4WSET8 = 324.51, 430.44 MB/s SPEEDUP = 1.33 N = 1024 R4WSET,R4WSET8 = 334.36, 470.84 MB/s SPEEDUP = 1.41 N = 2048 R4WSET,R4WSET8 = 339.38, 492.19 MB/s SPEEDUP = 1.45 N = 4096 R4WSET,R4WSET8 = 341.82, 505.49 MB/s SPEEDUP = 1.48 N = 8192 R4WSET,R4WSET8 = 343.28, 511.05 MB/s SPEEDUP = 1.49 N = 16384 R4WSET,R4WSET8 = 343.96, 514.65 MB/s SPEEDUP = 1.50 N = 32768 R4WSET,R4WSET8 = 344.16, 516.32 MB/s SPEEDUP = 1.50 N = 65536 R4WSET,R4WSET8 = 344.47, 516.78 MB/s SPEEDUP = 1.50 N = 131072 R4WSET,R4WSET8 = 344.56, 517.58 MB/s SPEEDUP = 1.50 N = 262144 R4WSET,R4WSET8 = 344.52, 517.85 MB/s SPEEDUP = 1.50 N = 524288 R4WSET,R4WSET8 = 344.78, 518.04 MB/s SPEEDUP = 1.50 N = 1048576 R4WSET,R4WSET8 = 344.65, 518.17 MB/s SPEEDUP = 1.50 N = 2097152 R4WSET,R4WSET8 = 344.74, 518.35 MB/s SPEEDUP = 1.50 N = 4194304 R4WSET,R4WSET8 = 344.86, 518.58 MB/s SPEEDUP = 1.50
SGI Power Challenge results:
N = 2 R4WSET,R4WSET8 = 22.16, 12.66 MB/s SPEEDUP = 0.57 N = 4 R4WSET,R4WSET8 = 54.40, 22.90 MB/s SPEEDUP = 0.42 N = 8 R4WSET,R4WSET8 = 99.73, 39.39 MB/s SPEEDUP = 0.40 N = 16 R4WSET,R4WSET8 = 140.79, 73.92 MB/s SPEEDUP = 0.53 N = 32 R4WSET,R4WSET8 = 227.93, 133.41 MB/s SPEEDUP = 0.59 N = 64 R4WSET,R4WSET8 = 330.14, 240.05 MB/s SPEEDUP = 0.73 N = 128 R4WSET,R4WSET8 = 425.43, 399.90 MB/s SPEEDUP = 0.94 N = 256 R4WSET,R4WSET8 = 497.27, 599.47 MB/s SPEEDUP = 1.21 N = 512 R4WSET,R4WSET8 = 543.04, 798.79 MB/s SPEEDUP = 1.47 N = 1024 R4WSET,R4WSET8 = 569.36, 957.71 MB/s SPEEDUP = 1.68 N = 2048 R4WSET,R4WSET8 = 583.59, 1063.42 MB/s SPEEDUP = 1.82 N = 4096 R4WSET,R4WSET8 = 590.88, 1126.19 MB/s SPEEDUP = 1.91 N = 8192 R4WSET,R4WSET8 = 594.41, 1160.20 MB/s SPEEDUP = 1.95 N = 16384 R4WSET,R4WSET8 = 596.47, 1177.80 MB/s SPEEDUP = 1.97 N = 32768 R4WSET,R4WSET8 = 597.23, 1187.54 MB/s SPEEDUP = 1.99 N = 65536 R4WSET,R4WSET8 = 597.57, 1191.56 MB/s SPEEDUP = 1.99 N = 131072 R4WSET,R4WSET8 = 597.94, 1193.86 MB/s SPEEDUP = 2.00 N = 262144 R4WSET,R4WSET8 = 597.76, 1194.96 MB/s SPEEDUP = 2.00 N = 524288 R4WSET,R4WSET8 = 590.84, 1196.06 MB/s SPEEDUP = 2.02 N = 1048576 R4WSET,R4WSET8 = 492.22, 1063.97 MB/s SPEEDUP = 2.16 N = 2097152 R4WSET,R4WSET8 = 132.57, 147.14 MB/s SPEEDUP = 1.11 N = 4194304 R4WSET,R4WSET8 = 115.25, 126.86 MB/s SPEEDUP = 1.10
DEC alpha (DEC2100_A500) results:
N = 2 R4WSET,R4WSET8 = 59.82, 42.88 MB/s SPEEDUP = 0.72 N = 4 R4WSET,R4WSET8 = 154.17, 76.87 MB/s SPEEDUP = 0.50 N = 8 R4WSET,R4WSET8 = 202.23, 129.86 MB/s SPEEDUP = 0.64 N = 16 R4WSET,R4WSET8 = 309.73, 218.63 MB/s SPEEDUP = 0.71 N = 32 R4WSET,R4WSET8 = 392.91, 308.34 MB/s SPEEDUP = 0.78 N = 64 R4WSET,R4WSET8 = 399.76, 367.70 MB/s SPEEDUP = 0.92 N = 128 R4WSET,R4WSET8 = 404.46, 408.07 MB/s SPEEDUP = 1.01 N = 256 R4WSET,R4WSET8 = 408.07, 431.09 MB/s SPEEDUP = 1.06 N = 512 R4WSET,R4WSET8 = 334.59, 445.04 MB/s SPEEDUP = 1.33 N = 1024 R4WSET,R4WSET8 = 316.13, 439.35 MB/s SPEEDUP = 1.39 N = 2048 R4WSET,R4WSET8 = 265.48, 369.67 MB/s SPEEDUP = 1.39 N = 4096 R4WSET,R4WSET8 = 214.54, 258.49 MB/s SPEEDUP = 1.20 N = 8192 R4WSET,R4WSET8 = 162.94, 195.06 MB/s SPEEDUP = 1.20 N = 16384 R4WSET,R4WSET8 = 137.66, 159.90 MB/s SPEEDUP = 1.16 N = 32768 R4WSET,R4WSET8 = 127.21, 140.47 MB/s SPEEDUP = 1.10 N = 65536 R4WSET,R4WSET8 = 118.96, 128.16 MB/s SPEEDUP = 1.08 N = 131072 R4WSET,R4WSET8 = 112.17, 115.56 MB/s SPEEDUP = 1.03 N = 262144 R4WSET,R4WSET8 = 104.98, 107.69 MB/s SPEEDUP = 1.03 N = 524288 R4WSET,R4WSET8 = 101.64, 101.27 MB/s SPEEDUP = 1.00 N = 1048576 R4WSET,R4WSET8 = 98.16, 98.79 MB/s SPEEDUP = 1.01 N = 2097152 R4WSET,R4WSET8 = 96.57, 96.98 MB/s SPEEDUP = 1.00 N = 4194304 R4WSET,R4WSET8 = 95.43, 95.70 MB/s SPEEDUP = 1.00
Sun SPARC 20/61 results:
N = 2 R4WSET,R4WSET8 = 25.78, 9.07 MB/s SPEEDUP = 0.35 N = 4 R4WSET,R4WSET8 = 39.02, 12.38 MB/s SPEEDUP = 0.32 N = 8 R4WSET,R4WSET8 = 74.87, 17.37 MB/s SPEEDUP = 0.23 N = 16 R4WSET,R4WSET8 = 104.07, 23.05 MB/s SPEEDUP = 0.22 N = 32 R4WSET,R4WSET8 = 130.78, 30.57 MB/s SPEEDUP = 0.23 N = 64 R4WSET,R4WSET8 = 149.31, 53.29 MB/s SPEEDUP = 0.36 N = 128 R4WSET,R4WSET8 = 160.56, 89.04 MB/s SPEEDUP = 0.55 N = 256 R4WSET,R4WSET8 = 166.90, 134.45 MB/s SPEEDUP = 0.81 N = 512 R4WSET,R4WSET8 = 170.30, 179.61 MB/s SPEEDUP = 1.05 N = 1024 R4WSET,R4WSET8 = 172.14, 217.24 MB/s SPEEDUP = 1.26 N = 2048 R4WSET,R4WSET8 = 172.94, 242.02 MB/s SPEEDUP = 1.40 N = 4096 R4WSET,R4WSET8 = 173.50, 255.16 MB/s SPEEDUP = 1.47 N = 8192 R4WSET,R4WSET8 = 173.52, 264.75 MB/s SPEEDUP = 1.53 N = 16384 R4WSET,R4WSET8 = 173.47, 268.55 MB/s SPEEDUP = 1.55 N = 32768 R4WSET,R4WSET8 = 171.82, 267.37 MB/s SPEEDUP = 1.56 N = 65536 R4WSET,R4WSET8 = 168.56, 264.91 MB/s SPEEDUP = 1.57 N = 131072 R4WSET,R4WSET8 = 162.87, 256.73 MB/s SPEEDUP = 1.58 N = 262144 R4WSET,R4WSET8 = 161.47, 247.10 MB/s SPEEDUP = 1.53 N = 524288 R4WSET,R4WSET8 = 41.48, 44.59 MB/s SPEEDUP = 1.07 N = 1048576 R4WSET,R4WSET8 = 41.47, 44.76 MB/s SPEEDUP = 1.08 N = 2097152 R4WSET,R4WSET8 = 41.60, 44.61 MB/s SPEEDUP = 1.07 N = 4194304 R4WSET,R4WSET8 = 41.49, 44.62 MB/s SPEEDUP = 1.08
Sun UltraSPARC 1/140 results:
N = 2 R4WSET,R4WSET8 = 49.78, 25.36 MB/s SPEEDUP = 0.51 N = 4 R4WSET,R4WSET8 = 129.54, 46.50 MB/s SPEEDUP = 0.36 N = 8 R4WSET,R4WSET8 = 117.02, 89.79 MB/s SPEEDUP = 0.77 N = 16 R4WSET,R4WSET8 = 197.45, 130.32 MB/s SPEEDUP = 0.66 N = 32 R4WSET,R4WSET8 = 269.68, 234.06 MB/s SPEEDUP = 0.87 N = 64 R4WSET,R4WSET8 = 362.30, 350.17 MB/s SPEEDUP = 0.97 N = 128 R4WSET,R4WSET8 = 424.71, 472.96 MB/s SPEEDUP = 1.11 N = 256 R4WSET,R4WSET8 = 460.14, 589.28 MB/s SPEEDUP = 1.28 N = 512 R4WSET,R4WSET8 = 482.29, 656.96 MB/s SPEEDUP = 1.36 N = 1024 R4WSET,R4WSET8 = 495.44, 709.14 MB/s SPEEDUP = 1.43 N = 2048 R4WSET,R4WSET8 = 501.21, 735.97 MB/s SPEEDUP = 1.47 N = 4096 R4WSET,R4WSET8 = 504.29, 747.54 MB/s SPEEDUP = 1.48 N = 8192 R4WSET,R4WSET8 = 505.50, 753.96 MB/s SPEEDUP = 1.49 N = 16384 R4WSET,R4WSET8 = 506.07, 757.10 MB/s SPEEDUP = 1.50 N = 32768 R4WSET,R4WSET8 = 505.23, 754.07 MB/s SPEEDUP = 1.49 N = 65536 R4WSET,R4WSET8 = 499.08, 739.38 MB/s SPEEDUP = 1.48 N = 131072 R4WSET,R4WSET8 = 381.13, 506.37 MB/s SPEEDUP = 1.33 N = 262144 R4WSET,R4WSET8 = 153.88, 171.16 MB/s SPEEDUP = 1.11 N = 524288 R4WSET,R4WSET8 = 149.14, 165.40 MB/s SPEEDUP = 1.11 N = 1048576 R4WSET,R4WSET8 = 149.13, 165.53 MB/s SPEEDUP = 1.11 N = 2097152 R4WSET,R4WSET8 = 149.15, 165.47 MB/s SPEEDUP = 1.11 N = 4194304 R4WSET,R4WSET8 = 149.31, 165.78 MB/s SPEEDUP = 1.11
In all cases, the REAL*8 version is faster for O(1000) vector lengths but not necessarily faster once the secondary cache size is exceeded. Presumably, hand coded assembly language could do even better.
The A = B case is similar, but only if A and B are appropriately aligned with each other. The BLAS routine SCOPY (HCOPY on T3D) should be the fastest way to do A = B, if it has been optimized for a given machine.
QuickTip Q & A
Q: What's a handy way to "vi" every file which contains a given string, in the current working directory, (E.g., You want to read every T3D Newsletter which mentions "CRAFT".) A: {{ How can you delete a file named "i" ??? }} rm ./i # Succinct! Sent in by a reader. rm  i # Also sent in. The flag, "" is common to many # UNICOS commands (e.g., "f90"), and says; "I am # the last flag."
[ Answers, questions, and tips graciously accepted. ]
Current Editors:
Email Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 9074508669 Kate Hedstrom ARSC Oceanographic Specialist ph: 9074508678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 997756020

Subscribe to (or unsubscribe from) the email edition of the
ARSC HPC Users' Newsletter.

Back issues of the ASCII email edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.