ARSC HPC Users' Newsletter 271, June 27, 2003

SX-6 Status

The SX-6 was installed a year ago at ARSC. It will remain here for one more year, until June 2004.

The SX-6 is available, by application only, to the U.S. HPC community for testing and benchmarking of codes. It's intended for research, not production. In the United States, this remains a unique opportunity.

For practical purposes, we recommend you not delay too long. A rush of users at the end of its tenure could make scheduling of dedicated runs more difficult and otherwise stress the system.

More information:

  • ARSC's primary SX-6 page, with links to application forms, statement of purpose, etc:

http://www.arsc.edu/support/news/SX_6update2.html

  • For current and potential SX-6 users, here's info on compilers, profilers, performance analysis, optimization, and just getting started:

http://www.arsc.edu/support/howtos/usingsx6.html

  • For porting and performance case studies, as well as user experiences of the SX-6, download the PDF file: "SX-6 Comparisons and Contrasts" from:

http://www.arsc.edu/support/technical.html

More questions? Contact: consult@arsc.edu

Technical Papers and Reports

We've updated our web site with the following recent research papers by ARSC staff:
  • Portable Cray Bioinformatics Library, J. Long, ARSC, Proceedings of the Cray User Group, May 2003

  • The ARSC Storage Solution, G. McGill, ARSC, Proceedings of the Cray User Group, May 2003

  • SX-6 Comparisons and Contasts, T. Baring, ARSC, Proceedings of the Cray User Group, May 2003

Go to:

http://www.arsc.edu/support/technical.html

Performance of ROMS on Regatta and SX-6

[ ARSC hosted two Air Force Academy Cadets as interns earlier this summer. Our thanks to Cadet Ryan Roper for this report. ]

ROMS Benchmark Comparison of Numbers of Processors, Tilings, and Running on the IBM Regatta and Cray SX-6

My study of optimal performance between the SX-6 and the Regatta has been interesting in my brief time at ARSC. My task was this: to find the best way to run the MPI compilation of the Regional Ocean Modeling System (ROMS) in 200 timesteps, discover the ARSC HPC system on which it runs the best, and compare it to the runners up. ROMS is an ocean simulation managed at ARSC by oceanographer Kate Headstrom, and for this benchmark, it runs an idealized square portion of a southern ocean. Before coming to ARSC, I had had absolutely no exposure to supercomputing and limited UNIX experience (which served me well once I finally remembered it) and this project was an excellent initial exposure.

From my tests, the ARSC host best suited to this ROMS benchmark is the IBM Regatta (known here as Iceflyer). I had the fewest compilation problems and the fastest absolute run times on that machine. When all was said and done, I had made 13 runs, varying the number of processors and tilings, to test the hypothesis that tilings closer to squares (2x2, 3x3, etc.) would run faster. A "tiling" is the way that the data is distributed across the processors. While I did find that large numbers of processors vastly improved the running time, the results were somewhat erratic and the square tiling did not always produce the best results.

My guess is that a square tiling may be ideal, but one would need the machine to one's self to test this. (The machine was shared with other batch users during these runs.) However, my recommendation is that if one needs to run ROMS on the Regatta, use as many processors as you can get your hands on (I found that 20 at 5x4 was best) and tile them so that the dimensions are decently close together. Unfortunately, f1n1, the Regatta main batch node, has only 24 processors or else a 5x5 tiling might have been ideal.


   Num
  Procs   Tiling  Running Time
  -----   ------  ------------
    1       1x1     1458.32
    2       2x1     689.83
    4       1x4     311.17
    6       3x2     254.08
    9       3x3     140.46
    12      3x4     107.82
    16      8x2     84.43
    20      5x4     71.44
Table 1.  Best case total running time on the Regatta

The SX-6 was no less busy than the Regatta, but there was a clear pattern with the most efficient tilings. From my tests, the best way to run a ROMS job on the SX-6 is to tile 1xN, where N = number of processors. This happens because the long skinny tilings make the most efficient use of the SX-6's long, 256 element, vector registers. The SX-6 at ARSC has 8 processors in a single shared memory cabinet.

What was also interesting is that the single and dual processor runs of the SX-6 were faster than those of the Regatta. As the Regatta increases its processor number, performance more closely matched the SX-6 until, by sheer availability of processors, the Regatta pulled away. One will notice that the time spiked up at 6 processors. I would like to investigate that in more depth, but from a rough re-run of the simulation, I can guess that it is due to greater usage of the machine; the re-run was run in a less crowded batch queue and produced a time more in line with expected results. What would be interesting to see would be if there were more available processors in the SX-6 to compare it better to the Regatta. It certainly is a capable machine.


   Num
  Procs   Tiling  Running Time
  -----   ------  ------------
    1       1x1     1159.859
    2       1x2     596.588
    3       1x3     416.795
    4       1x4     321.181
    5       1x5     261.259
    6       1x6     272.898
    7       1x7     199.577
Table 1.  Best case total running time on the SX-6

The following graph summarizes the timings. It shows how similarly the code performs on the Regatta and SX-6 on 1-7 processors, and the advantage of additional processors on the Regatta:

Figure 1. Regatta and SX-6 ROMS scaling

In conclusion, it is best at ARSC to run ROMS on the Regatta. I did additional tests with the Cray SV1ex and the Cray T3E, but I had more than my share of difficulties getting the MPI and Open MP versions to even compile let alone run. Once they were running they performed nowhere near as well as the Regatta had, even with large numbers of processors. I was impressed by the SX-6's performance and I'm sure it would be well suited to run ROMS if it weren't for its limited number of CPUs.

Ryan L. Roper USAF Academy Class of 2004

Quick-Tip Q & A



A:[[ Here's one person's definition of "code-blindness", grabbed off the
  [[  web:
  [[
  [[   "... the inability to actually work out what on earth your code is
  [[   doing, even though you were wholly responsible for it..."
  [[
  [[ Programmers: do you have a technique for snapping yourself out of
  [[ code-blindness, or avoiding it in the first place?



  Here are a few ideas from the editors:

  - Explain the problem to someone else.  It's amazing how often you'll
    see your own problem in a new light when you attempt to explain it.

  - Re-document, and even over-document, the code. This can force you
    to rethink it and understand it from the algorithms down to the
    names of variables.

  - Revise it (or start) with better variable names.  Mix case.


  [ Feel free to respond late to this question.  We'd like to expand 
    the above list with your experience. ]




Q: Is there an easy way to extract a column from a regular text file?
   For instance, a column of data.  Or am I back to writing a perl
   script?

[[ Answers, Questions, and Tips Graciously Accepted ]]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top