ARSC T3D Users' Newsletter 48, August 18, 1995
A Comparison of the SRC ac Compiler and the CRI cc Compiler
When the AC compiler was announced at the Spring CUG meeting, there was a lot of interest because it was supposedly faster than the Standard C compiler provided by CRI. I believe CRI's contention was that the speed differences vary from program to program. To investigate this claim, I tried some standard benchmarks and some of my own benchmarks to put some numbers into the argument. The two tables below show the effect of the standard optimization switches for each compiler. Each table has the same format. (When results are given in MFLOPS, Dhrystones or Whetstones then bigger is better, when the results are in seconds then small is beautiful.)
Table 1 Performance results for the CRI Standard C compiler (PE 1.2.2): compiler switch: NONE -O -O0 -O1 -O2 -O3 Livermore loops long loops 9.98 10.03 1.84 8.59 10.01 11.41 MFLOPS medium loops 9.72 9.73 1.84 8.38 9.73 10.27 MFLOPS short loops 9.73 9.74 1.84 8.40 9.37 10.01 MFLOPS Linpack 100 single 14.6 14.6 2.7 14.6 16.4 16.3 MFLOPS double 12.6 12.6 2.6 12.4 13.8 13.8 MFLOPS 1000 single 15.3 15.3 2.8 15.3 15.8 15.8 MFLOPS double 10.8 10.8 2.6 10.8 11.0 11.0 MFLOPS Dhrystones with 37037 37037 28571 45454 47619 55556 Dhrystones without 37037 38461 28571 45454 45454 55556 Dhrystones Whetstones 36405 36404 36405 36405 36406 36397 Whetstones Puzzle 45.8 45.8 135.0 40.8 45.8 48.2 seconds Nest calls 112.8 112.8 175.9 1.3 0.0 0.0 seconds inline 0.2 0.2 72.5 1.7 0.2 0.2 seconds Table 2 Performance results for the IDA SRC Gnu C compiler (AC 2.6.2): compiler switch: NONE -O -O0 -O1 -O2 -O3 -O4 -O5 Livermore loops long loops 2.37 8.26 2.42 8.24 9.52 9.51 9.51 9.51 MFLOPS medium loops 2.33 7.47 2.33 7.47 8.36 8.35 8.34 8.34 MFLOPS short loops 2.33 7.33 2.33 7.34 8.14 8.18 8.17 8.15 MFLOPS Linpack 100 single 3.7 15.2 3.7 15.2 17.0 17.0 17.0 17.0 MFLOPS double 3.6 13.0 3.7 13.0 14.3 14.3 14.3 14.3 MFLOPS 1000 single 3.9 15.4 3.9 15.4 16.8 16.8 16.7 16.7 MFLOPS double 3.5 11.0 3.5 11.0 11.6 11.6 11.6 11.6 MFLOPS Dhrystones with 45454 66667 45454 66667 90909 90909 90909 90909 Dhrystones without 43478 66667 43478 66667 90909 100000 90909 90909 Dhrystones Whetstones 24160 37896 24160 37896 38612 38613 38611 38611 Whetstones Puzzle 113.8 36.5 113.8 36.5 33.6 35.2 33.6 33.6 seconds Nest calls 223.5 75.1 223.5 75.1 120.4 120.4 120.3 120.4 seconds inline 125.2 18.1 125.2 18.1 17.9 17.9 17.9 17.9 secondsBoth compilers support other performance switches (the gnu C compiler presents a "switch Heaven" for those inclined), but I did not test them. Similarly the sources and timers I used were exactly the same, but modification to the codes could have dramatically changed the results. The source for each C benchmark is available from the netlib ftp site at netlib2.cs.utk.edu or from me. Below is a short description of each benchmark and its results:
Livermore Loops - this is the standard Fortran benchmark converted to C. It times 24 loops, which consist of mostly floating point operations. What is shown above is the harmonic mean for all 24 loops when run with loop lengths short (average 18), medium (average 89) and long (average 468). On both compilers, there is a slight increase in the MFLOPS rate with increasing loop length.
Linpacks - this code is in C and is not comparable to the published Fortran results. This benchmark is really just a timing of a single saxpy loop but that loop is still the most common loop of linear algebra. Both compilers did best on the version that unrolled that saxpy loop. There is maybe a 50% performance improvement for single precision over double precision on this benchmark.
Dhrystones - a character and integer benchmark, it's been around a long time and some compilers have accumulated "dhrystone tricks" over the years.
Whetstones - times mostly elementary functions from libm, which is the same /mpp/lib/libm.a for both compilers.
Puzzle - this is my own benchmark that tests integer arithmetic, array references and deeply nested loops.
Nest - this benchmark computes N! with deeply nested loops and a counter in the innermost loop. If the compiler can inline automatically and simplify the loop nest, then there are real performance differences.
On the Gnu compiler we have that:
no compiler optimization switches = -O0 -O = -O1I couldn't detect similar simple rules for the CRI compiler.
- It is not always the case, for either compiler, that performance is a monotonic function in O(z) where z is increasing. (This to me, is always a surprise, I guess I should stop being surprised.)
- At the highest level of optimization for both compilers, the AC compiler is not always faster than the CRI compiler. In particular, the CRI compiler is faster on the Livermore Loops, but slower on Linpack.
The 1.2.2 Release of the Programming EnvironmentAs of the next downtime (6:00 PM, August 22, 1995) ARSC will be running the 1.2.2 Programming Release as the default. If you have any problems with this release please contact Mike Ess.
Announcement on the email@example.comThe following announcement appeared this week on the firstname.lastname@example.org reflector:
> > Announcement for anyone interested in a T3D tool for partitioning > unstructured problems. > > I have developed a program called pmrsb (Parallel Multilevel Recursive > Spectral Bisection) that partitions graphs and finite-element meshes > in parallel on the T3D. It determines processor assignments for > vertices of a graph or elements of a mesh that simultaneously balance > load and minimize interprocessor communication. In addition to > partitioning a graph, pmrsb can generate a dual graph from a > finite-element mesh. It should be able to handle very large problems > (> 10**6 elements). > > The pmrsb code is not an officially supported Cray Research product > but I can make it available to interested customers. Please let me > know if you would like to try it. > > Steve Barnard > email@example.com >
The T3D ReflectorThere is a T3D news reflector that you can subscribe to by sending e-mail to firstname.lastname@example.org with a short note saying you would like to be on the list of recipients. Bob Stock and Rich Raymond of the Pittsburgh Supercomputer Center and Fred Johnson of the NIST are responsible for setting it up. The above announcement was circulated through this reflector.
List of Differences Between T3D and Y-MPThe current list of differences between the T3D and the Y-MP is:
- Data type sizes are not the same (Newsletter #5)
- Uninitialized variables are different (Newsletter #6)
- The effect of the -a static compiler switch (Newsletter #7)
- There is no GETENV on the T3D (Newsletter #8)
- Missing routine SMACH on T3D (Newsletter #9)
- Different Arithmetics (Newsletter #9)
- Different clock granularities for gettimeofday (Newsletter #11)
- Restrictions on record length for direct I/O files (Newsletter #19)
- Implied DO loop is not "vectorized" on the T3D (Newsletter #20)
- Missing Linpack and Eispack routines in libsci (Newsletter #25)
- F90 manual for Y-MP, no manual for T3D (Newsletter #31)
- RANF() and its manpage differ between machines (Newsletter #37)
- CRAY2IEG is available only on the Y-MP (Newsletter #40)
- Missing sort routines on the T3D (Newsletter #41)
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.