ARSC HPC Users' Newsletter 236, January 4, 2002

Direct Numerical Simulation of Turbulent Flows

Steve de Bruyn Kops has a brief web page introducing some of his work. The images are from runs on ARSC's T3E:

http://www.ecs.umass.edu/mie/faculty/debk/Research/dns.html

Removing Redundancy - A Powerful Optimization Technique

[ Many thanks to John L. Larson for submitting this in response to our request for comments in the "Vectorization Quiz," in the last newsletter. ]

Here are abbreviated HPM outputs of the original and 2 optimized implementations of the Vectorization Quiz given in Newsletter 235

Original Implementation


  Group 0:  CPU seconds   : 20.90579      CP executing     :  10452894570
  Floating adds/sec       :     9.57M     F.P. adds        : 200000514
  Floating multiplies/sec :    11.96M     F.P. multiplies  : 250000430
  Floating reciprocal/sec :     2.39M     F.P. reciprocals : 50000002
  Floating ops/CPU second :    23.92M     ****total 500,000,946 Fl Pt Ops

Second Implementation


  Group 0:  CPU seconds   :    8.02257      CP executing     : 4011283880
  Floating adds/sec       :      31.16M     F.P. adds        : 250000513
  Floating multiplies/sec :      37.39M     F.P. multiplies  : 300000430
  Floating reciprocal/sec :       6.23M     F.P. reciprocals : 50000002
  Floating ops/CPU second :      74.79M   ****total 600,000,945 Fl Pt Ops

Third Implementation


  Group 0:  CPU seconds   :    0.65617      CP executing     : 328084890
  Floating adds/sec       :     609.60M     F.P. adds        : 400000584
  Floating multiplies/sec :     838.20M     F.P. multiplies  : 550000426
  Floating reciprocal/sec :     152.40M     F.P. reciprocals : 100000001
  Floating ops/CPU second :    1600.20M   ****total 1,050,001,011 Fl Pt Ops

While the third implementation in the Newsletter performs in about 1/32 of the CPU time of the original implementation, one can see from the HPM statistics that it also has 2.1 times as many Floating Point Operations as the original implementation. While the third implementation has an execution rate of 1600 MFlops/sec (67 times higher than the original implementation), this extra work degrades the final performance to a speedup of only 32.

It is very important to recall the difference between performance and execution rate to completely understand what is happening here.

Look at the CPU time, not the Mflops rate. Mflops rate is not a mathematically valid comparison of performance between two implementations, when the amount of work changes.


  Performance (implementation 1) =     1 / T1  (T1 = CPU Time of impl. 1)
  Execution Rate (implementation 1) =  W1 / T1 (W1 = work in impl. 1)
  
  Performance (implementation 2) =     1 / T2  (T2 = CPU Time of impl. 2)
  Execution Rate (implementation 2) =  W2 / T2 (W2 = work in impl. 2)
  
  Relative Performance (Speedup) =     T1 / T2
  
  Relative Execution Rates
                  (W2/T2)/(W1/T1) =    (T1/T2) * (W2/W1)    (equation 1)
                                  =    Speedup * (W2/W1)
                  (W2/W1) is called Redundancy

Comparing performance with Execution Rates is only valid when W1=W2 (Redundancy = 1.) In fact, an implementation with a higher Execution Rate can actually take more CPU time, if the Redundancy is large enough.

In the above example, the relationship of performances and execution rates of the original and third implementation can be stated as


  Execution Rate (orig. impl.) = 23.92 MFlops/sec
  Execution Rate (3rd impl.) = 1600 MFlops/sec

  Relative Execution Rate = 1600 / 23.92 = 67

  Relative Performance (Speedup) = 20.9 / 0.656 = 32

  Redundancy = 1,050,001,011 / 500,000,946 = 2.1
  
                  67 = 1600 / 23.92   =    32 * 2.1     (equation 1)

A close examination of the generated code would show that the third implementation evaluates the function, f, twice for the same argument. This is the source of the redundancy. However, evaluating the function, f, twice for the same argument are two independent operations and doing so increases the parallelism in the program which may be exploited for vectorization or parallel processing. This is one reason why the third implementation has such a high execution rate.

During optimization, always check the (work) Redundancy of your implementations with the HPM.

This example leads one to wonder if there is an optimization of the original implementation that improves performance but does not increase the redundancy along the way.

Look more closely at the original loop, and determine exactly what is being computed.


  25.     integral = 0
  26.     do i = 0 , n-1
  27.             integral = integral + h*( f(a+i*h) + f(a+(i+1)*h) )/2.0
  28.     enddo

Unrolling this loop entirely gives


        integral =        h* ( f(a+0*h) + f(a+1*h) ) /2.0
                        + h* ( f(a+1*h) + f(a+2*h) ) /2.0
                        + h* ( f(a+2*h) + f(a+3*h) ) /2.0
                        + h* ( f(a+3*h) + f(a+4*h) ) /2.0
                        etc
                        + h* ( f(a+(n-2)*h) + f(a+(n-1)*h) ) /2.0
                        + h* ( f(a+(n-1)*h) + f(a+(n)*h) ) /2.0

which is the same as


        integral =        h*(
                          f(a+0*h) * 0.5
                        + f(a+1*h)
                        + f(a+2*h)
                        etc
                        + f(a+(n-1)*h)
                        + f(a+n*h) * 0.5
                        )

which is better computed with in-lining enabled as


        integral = f(a) * 0.5
        do i = 1, n-1
                integral = integral + f(a+i*h)
        end do
        integral = h * ( integral +  (f(a+n*h) * 0.5) )

The abbreviated HPM output for this version is

Fourth Implementation


  Group 0:  CPU seconds   :    0.34496      CP executing     :      172479500
  Floating adds/sec       :     579.78M     F.P. adds        : 
           200000585
  Floating multiplies/sec :     724.73M     F.P. multiplies  :      250000432
  Floating reciprocal/sec :     144.94M     F.P. reciprocals :       50000002
  Floating ops/CPU second :    1449.45M   ****total 500,001,019 Fl Pt Ops

Comparing this fourth implementation to the original implementation we have


  Execution Rate (orig. impl.) = 23.92 MFlops/sec
  Execution Rate (4th impl.) = 1449 MFlops/sec
  
  Relative Execution Rate = 1449 / 23.92 = 61
  
  Relative Performance (Speedup) = 20.9 / 0.345 = 61
  
  Redundancy = 500,001,019 / 500,000,946 = 1.0

                61 = 1449 / 23.92   =    61 * 1     (equation 1)

It is interesting to compare the fourth implementation to the third implementation.


  Execution Rate (3rd impl.) = 1600 MFlops/sec
  Execution Rate (4th impl.) = 1449 MFlops/sec
  
  Relative Execution Rate = 1600 / 1449 = 1.1
  
  Relative Performance (Speedup) = 0.345 / 0.656 = 0.526
  
  Redundancy =  1,050,001,011 / 500,001,019 = 2.1
  
                  1.1 = 1600 / 1449  =     0.526  *   2.1   (equation 1)

This demonstrates that an implementation that has a lower execution rate may actually solve the problem in less time because the higher execution rate program is overburdened with the redundancy. A Relative Execution Rate greater than 1 does not imply a Relative Performance greater than 1.

Increasing the work can sometimes improve Vectorization and MFlops rate, and maybe even decrease the CPU time (especially with a high vector/scalar rate ratio), but this should not be done unless necessary. Try to optimize first without increasing (or by decreasing) the work by understanding exactly what is being computed. Appreciate and use all the information that the HPM counters are providing.

I think Seymour Cray said it so precisely - "I am all for simplicity. If it's too complicated, I can't understand it !"

or

Albert Einstein - "Everything should be made as simple as possible, but not simpler."

Reference

See "A Parallelism-Based Analytic Approach to Performance Evaluation Using Application Programs," David K. Bradley and John L. Larson, invited paper, Proceedings of the IEEE, Vol. 81, No. 8, August 1993, pp. 1126-35. Available from the author.

John L. Larson j_larson@dnai.com

Taming the I/O Beast

[ Thanks to Jeff McAllister of ARSC for this article. ]

Recently I was working on a project with potentially huge I/O requirements. We were doing some rapid prototyping on the simplest case of the problem: reading an array in from a file and doing computations based on that data. This was all fine and dandy at the test case resolution but seemed like it could get pretty scary in the "real" case with larger arrays and needing to read a few gazillion of them. Adding a few zeroes to the magnitude of a problem has a way of making previously insignificant things monstrous. In this case, bashing some numbers into a calculator showed we might be spending a year or two just waiting on disks. So I decided to do some research into how to best deal with the I/O beast given the equipment at ARSC. There can be huge differences in filesystem speeds and how fast they can be accessed from a program -- enough to make impossible things feasible again.

I needed answers to these questions:

  1. how much faster can I/O be made with code changes?
  2. how much faster can I/O be made by changing filesystems and machines?

Though this article focuses on ARSC equipment, the principals enumerated in this article are generally applicable. If you're not an ARSC user you can skip the detailed descriptions of ARSC's environment. Wherever you work you'll face similar I/O performance issues.

Method

In order to compare performance everywhere in a standard manner, the numbers in this article were obtained from a script which compiles a short Fortran 90 code on all ARSC systems and times how long it takes to write a file of a given size. Bandwidth is derived as (file size/write time). Reads should be slightly faster. Times given are an average of several runs under varying conditions.

As the project I was working on was primarily in Fortran the examples will stick to that language. A comparison of C/C++ and Fortran I/O is out of the scope of this article (though perhaps I'll cover it in another one).

Optimizing I/O by changing code

After comparing results from my standardized tests, it became clear that relatively simple changes in code can result in tremendous speedup -- and that what works fastest works fastest everywhere.

Single Element vs Whole-array I/O Coding

Worst: single array elements. If your code explicitly loops through an array and outputs only one at a time.


do iz=1,maxz
  do iy=1,maxy
    do ix=1,maxx
      write (unit=23) A(ix,iy,iz)
    ...

Better: outputting several elements at once (i.e. an entire row/column)


do iz=1,maxz
  do iy=1,maxy
    write (unit=23) (A(ix,iy,iz),ix=1,maxx)
  ...

Best: entire array at once. A general principle is that bigger "chunks" go faster. Fortran 90 makes this easy with array syntax.


    write (unit=23) A

Formatted vs Unformatted I/O

Formatted I/O is both human readable and the easiest way to make sure results are portable anywhere. Portability is a good thing because it's tedious and time-consuming to convert datasets every time they are needed on a different system, not to mention wasteful of storage space to keep several copies. It would be nice if it were easier to get around the proprietary nature of binary storage (using the IEEE standard is another topic I'll have to pick up at a later time). ASCII (formatted) output is ideal for portability -- unfortunately you need to give up precision in order to keep file sizes reasonable and it's deadly for performance.

I had always heard that formatted I/O was slower, though I didn't realize how much slower until I looked at the numbers. Here's a short table of observed bandwidths writing a 1 MB file, obtained from my trusty SGI O2's /scratch.


                  formatted    unformatted
single element I/O  .17 MB/sec     .65 MB/sec
full array I/O      .73 MB/sec   15.56 MB/sec

Using the slowest time as a base, here's the same table with speedups instead:


                  formatted    unformatted
single element I/O               3.8x
full array I/O      4.3x        91.5x

(Your results will vary -- my attempt here is to illustrate just how widely bandwidth can differ between the different methods.)

Once I'd figured out how to move data in and out from any filesystem faster, I needed to make sense of the many filesystems available at ARSC to figure out the best machine to run the code on and which filesystems to use.

ARSC I/O Architecture Overview

There are places to read/write data quickly and places to store it for a long time. Everything is a tradeoff, so results can sometimes be obtained much faster if where the data is stored while being actively used is not the same place as it is archived. System utilities like 'cp' will perform much better than any user code.

ARSC has organized things such that short-term (i.e. /scratch, /tmp, /ssd) filesystems are optimized for performance while adding nice management features for long-term filesystems. These include DMF (Data Migration Facility), a seamless integration of tape and hard drive available on the Crays which makes the terabyte-scale space of the StorageTek silo accessible under the same directory structure. Under DMF files > 16 KB can be migrated to tapes in the StorageTek silo and not count against your quota.

All of this fits the basic workflow model


         preprocess -> work -> postprocess

where the preprocessing and postprocessing stages can include movement to and from fast temporary storage.

Here's a list of the filesystems available at ARSC. (For details on quotas, purging, and backups, see ARSC's storage policy: http://www.arsc.edu/support/howtos.storage.html .) You can see what is available on a particular system using the 'df' command. I like 'df -Pk' as it standardizes the output on different systems and it has always been easier for me to think in kilobytes instead of the default 512KB blocks.

  • Home directories: generally, these start with /u1 or /u2. On the HPC systems (yukon, chilkoot, icehawk) they are local drives, and on the Crays, they are managed by DMF. Home directories on ARSC SGI systems are NFS mounted.

  • Local temporary drives : /scratch on the IBM and the SGIs, /tmp on the Crays, /vidscratch exists only on video2. Each IBM node has its own /scratch. These are purged and not backed up. They aren't much faster than the home directories (except on the SGIs) but their role is to provide extra space for short-term work.

  • High-Performance filesystems : /tmp2 (yukon, chilkoot, icehawk) and /ssd (chilkoot only). As the number of users on a filesystem affects its performance, these are restricted to only a few users at a time. The /tmp2 filesystems are striped, so I/O speed is multiplied over many disk heads. The only way these striped drives become faster than the others is when moving vast amounts of data. Fortunately, we recently added a very fast filesystem without this constraint. /ssd arrived with the SV1ex upgrade. Memory beyond the machine's normal addressable space was configured as a filesystem. This skips all of the hardware overhead required for traditional disks, and it's physically close -- just more memory chips in the same cabinet. This makes it fast (but temporary). Not surprisingly, /ssd shows the highest repeatable bandwidth for all sizes of files.

  • "omnipresent" filesystems : /allsys/u1 and /viztmp/u1 can be "seen" by all systems. /allsys/u2 and /viztmp/u2 exist only on the HPC systems (yukon, chilkoot, and icehawk). From chilkoot they are local drives. They are NFS mounted everywhere else meaning all data must travel through a network. Except on chilkoot, these drives are not an option when performance is critical. /allsys is for long-term storage, especially between multiple users and systems. /viztmp is primarily a staging area for use while creating large visualizations.

Here's an image showing ARSC's file systems:

In the image, filesystems available on a machine are listed by that machine, with my sloppy lines to indicate network connections.

Network file systems (NFS) don't look any different from local drives from the prompt, though there is no way they can perform anywhere near as well as a local drive because all data has to cross a network. Performance is optimized between yukon, chilkoot, video1, and video2 as they communicate across a HIPPI network. This has several benefits, one of which is that allsys and viztmp performance is much closer to a local drive between these systems.

Which filesytem/method should I use?

Now that I had a full list of filesystems, I wanted to know:

  1. How to best take advantage of a fast filesystem.
  2. How much speedup to expect by moving to a fast filesystem.

Using fast filesystems

No matter which machine or filesystem, formatted and single-element I/O perform poorly. Large file sizes don't go any faster, so you're just left waiting a long time as your needs scale up.


                                    time required to write a file of
I/O method          avg. MB/sec      1MB          1GB          10GB
------------------  ------------   ---------  ----------  ------------
single unformatted 
and array formatted   .9            1.1 sec    18.5 min    3.1 hours
single formatted      .2            5.0 sec    83.3 min    13.9 hours

Comparing ARSC filesystems

It turned out that the only way to make repeated tests feasible was to use whole-array unformatted I/O. Because the other options are so slow and perform the same everywhere, regardless of size or filesystem, the following numbers use whole-array unformatted I/O only. The faster filesystems especially perform better with larger files of this type only.

                                  1MB                                1GB            
bw seconds bw seconds
chilkoot /ssd 85.29 0.01 282.77 3.56
/tmp2 10.90 0.09 196.08 5.28
/tmp 14.99 0.07 28.32 36.77
/allsys/u1/uaf 13.71 0.07 13.71 70.00 *
/u1/uaf 11.99 0.09 11.99 90.00 *
icehawk1 /tmp2 9.24 0.39 24.51 40.83
/u1/uaf 21.17 0.07 14.67 69.04
/scratch 21.50 0.06 9.33 107.97
/allsys/u1/uaf 0.20 6.26 0.20 6260.00 *
video2 /vidscratch 76.20 0.01 121.65 8.23
/scratch 77.69 0.02 119.36 8.40
/u1/uaf 3.84 0.26 3.84 n/a
/allsys/u1/uaf 2.39 0.42 2.39 420.00 *
yukon /tmp2 7.63 0.14 54.89 18.27
/tmp 12.83 0.09 18.81 54.98
/u1/uaf 10.40 0.10 10.40 100.00 *
/allsys/u1/uaf 1.46 0.70 1.46 700.00 *

*=1 GB time interpolated for comparison, not a high-performance filesystem

Answering my question about how much difference a faster filesystem can make -- if whole array unformatted I/O is used then, yes, the filesystem used can make a huge difference. If I had a code writing gigabytes of data on allsys from the IBM and was able to port it to chilkoot and use /ssd, I would see roughly 1500x speedup in I/O! Writing out that 1GB file would drop from nearly 2 hours to a few seconds, which would make it well worth my while to come up with some script to move data not currently being used to allsys. It's local on chilkoot and ftp or cp will always outperform any code I could write.

Conclusion

I/O can be made several orders of magnitude faster by

  1. using a faster method (i.e. reading/writing an entire array at once, unformatted)
  2. using a faster filesystem.

Generally, converting I/O methods and using faster filesystems only makes sense if I/O is a significant amount of total runtime and it's possible to set up your algorithm so it reads and writes large "chunks" of unformatted data at a time. High-performance filesystems show higher bandwidth but only under the right conditions. If these can't be met because of requirements or program design then performance will be the same slow rate just about everywhere.

Quick-Tip Q & A


A:[[ What are your "New Years 'Computing' Resolutions" ???

   
#
# We got responses from three different readers.  Thanks! 
# And Happy New Year, everyone.
#

I resolve to finally back up all of my important data that is laying
around on various aging floppy disks, zip disks, and hard drives.  Now
that my CD-RW drive is working, I no longer have an excuse.

I resolve to learn to use autoconf/automake with f90 and to brush up on
Perl enough to rewrite my aged f90 makedepend script.

I resolve to get the Gulf of Alaska ocean model running out past a year,
looking sensible.

Here is the short list of my New Years Computing Resolutions,

  I resolve to finish reading
    - Beginning Perl for Bioinformatics
    - Developing Bioinformatics Computer Skills
    - Using MPI
    - SP System Performance Tuning Redbook
    - Learning Maya2
    - And for relaxation: Homers' Oddessy. (Translated version.  I'm not
      educated/crazy enough yet too read it in the original! (-;)

  I resolve to improve my computing skills in
    - Fortran
    - Perl
    - OpenMP
    - MPI
    - ROMS
    - Various Biological software packages
    - Linux Porting of various scientific packages.
    - C/C++
    - AIX

  And finally in March I plan to make a new list of resolutions. Ya
  gotta keep busy!


#
# And some ideas from the editorial staff...
#

New Year Resolutions.

  - I will use a tool to inspect the performance of the code I run.
  
  - I will improve my job scripts.
  
  - I will consider my storage needs.
  
  - I will tell centres if I need anything.
  
  - I will give my datafiles meaningful names, create readmes in the
    directories and make sure there is metadata, in a human readable form
    within any new datafile structures I create.
  
  - I will share my experiences with others, both good and bad. (The
    newsletter being a good place to start.)
  
  - I will test my code to make sure the answers are valid whenever I
    change anything, like compiler options, system etc.
  
  - I will take some training, or at the very least tell people what
    training I feel I need. Or if I have students working for me I will
    make sure they take training or specify anything they might need.

  - I will try better methods of working and not just blindly use the
    computer as a bigger hammer.




Q: When building cvs on AIX the 'make check' fails.

  Tracking it down comes down to a difference in the expr command
  
  aix$   expr 'Name: $.' : "Name: \$\." ; echo $?
    0
    1
  
  irix$  expr 'Name: $.' : "Name: \$\." ; echo $?
    8
    0
  
  It works "correctly" on Irix, Unicos, Tru64 and Linux.  It fails on AIX
  and Solaris.  The above works on both Solaris and AIX if one adds a
  space between the '$' and the '.' (both sides of the ':).
  
  Anybody else ever run into this?  Any suggestions?
 

[[ Answers, Questions, and Tips Graciously Accepted ]]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top