ARSC HPC Users' Newsletter 280, October 24, 2003
A Tale of Porting to the Cray X1
[ Thanks to Kate Hedstrom of ARSC for this article. ]
My primary application is an ocean model called ROMS (the Regional Ocean Modeling System). Before I can even try to compile it, I need to have access to the NetCDF library. I have compiled NetCDF several times over the years, but I am by no means an expert on it. It is usually very easy to compile the C part, with the f77, f90 and C++ parts being trickier. On the X1, I had some trouble with the C part because of one of those #ifdef cray do something special, #else do the default. On the X1, you want to be using the default POSIX code. Beyond that, I accepted the cray-compiled NetCDF library.
Back to ROMS, the first thing to do is to find out what arguments to give selected_real_kind for real*4, real*8, and real*16. (See an article devoted to "kind" issues in newsletter #263 .) Our code is written to use the r8 kind for 64-bit reals in most of the computations, including literal constants. Since we are managing our own kinds, I didn't feel that Cray ftn's "-s default64" option would be necessary.
With the "kind" issue settled, I went ahead and compiled and ran a serial job. The only change I had to make to the code was to work around a missing getpid function (informational, not really necessary). The job ran fine. With the serial code running, I went ahead and tried to run an MPI job. It died pretty quickly inside the MPI message passing. It was another #ifdef CRAY problem regarding the size of default reals. It cropped up in the size of messages to pass, and also in the writing of NetCDF files.
The moral of the story is to search out those special Cray cases, assuming your code currently runs on both Cray and other flavors of Unix.
Once the code runs and results have been verified for correctness, the question of performance comes up. We have a standard ROMS case needing no input files but including most of the complicated physics of our more realistic domains. A number of obvious interest is the run time on the IBM Power4 system (iceflyer).
ROMS has its own built in timing, to find the relative costs of the major model components. The IBM times vary depending on the other jobs in the system, the best achieved being 1020 seconds, broken out as follows:
Model 2D kernel .......................... 237.580 (23.5765 %) KPP vertical mixing parameterization ..... 176.150 (17.4804 %) 3D equations predictor step .............. 113.330 (11.2464 %) Atmospheric boundary layer coupler ....... 103.560 (10.2769 %)plus smaller stuff. The Mflips rating on this is:
Flip rate (flips / WCT) : 521.414 Mflip/sec Flips / user time : 530.482 Mflip/secwhich is exactly 10% of peak, peak being 5.2 Gflip/sec on this 1.3 GHz power4 processor.
The Cray timing for the original ROMS code is 2510 seconds, broken out as follows:
Model 2D kernel .......................... 38.333 ( 1.5286 %) KPP vertical mixing parameterization ..... 855.806 (34.1271 %) 3D equations predictor step .............. 15.775 ( 0.6291 %) Atmospheric boundary layer coupler ....... 1531.398 (61.0678 %)
Obviously, some parts of the code are faster on the X1, but some parts are vastly slower. About 95% of the time is spent in two vertical mixing routines which didn't vectorize particularly well. The first of these routines was sped up by inlining one of its constituent functions, by adding "-O inlinefrom=lmd_wscale.f90", to the compile line for the two routines which use it. For some loops, inlining will enable vectorization which would otherwise be inhibited by procedure and function calls.
This extra option cannot be added to the compile for all the files because the function we are inlining uses f90 modules which have to be compiled before the function can be inlined.
With inlining, the X1 time dropped to 1652 seconds:
Model 2D kernel .......................... 38.531 ( 2.3333 %) KPP vertical mixing parameterization ..... 27.370 ( 1.6575 %) 3D equations predictor step .............. 15.783 ( 0.9558 %) Atmospheric boundary layer coupler ....... 1502.950 (91.0149 %)It is pretty clear that this bulk_flux atmospheric boundary layer code needed some attention. A specialist from Cray and I rewrote the code to further improve vectorization and clean up some unnecessarily weird logic in it. The final X1 timing was 171 seconds:
Model 2D kernel .......................... 38.004 (22.3496 %) KPP vertical mixing parameterization ..... 27.273 (16.0387 %) 3D equations predictor step .............. 15.648 ( 9.2024 %) Atmospheric boundary layer coupler ....... 23.059 (13.5604 %)The answer was even the same! I never imagined it would go so fast. The flop rating was:
Total FP ops 2679.144M/sec 452281560816 opswhich is 20% of peak for one Cray X1 MSP (which has a peak theoretical performance of 12.8 Gflops).
MSP vs. SSP Experiment
All timings shown above for the X1 are for one MSP, or one multi-streaming processor. Each MSP is actually composed of four SSPs, or single-streaming processors, which share cache and have other hardware and software support (multi-streaming) to make them behave as one processor. SSPs can, however, be accessed individually.
Early on, I ran some timings to assess SSP vs. MSP performance and found that four SSPs were about twice as fast as one MSP. However, 16 MSPs were vastly faster than 64 SSPs, in fact I couldn't get the 64 SSP run to finish - it kept timing out. With the X1 optimized version of ROMS, I sincerely doubt that I could get the four SSP run to be faster than one MSP, now that the compiler is able to vectorize and multistream both vertical mixing routines.
Butrovich Police Blotter
Residents of the UAF Butrovich building (which includes ARSC) received this actual email about a week ago. I've only changed the license numbers...
> > It has been brought to my attention that there was a small accident > in the parking lot this morning. > > A silver Subaru sedan, license plate #nn-nnn apparently started > rolling backwards in the parking lot, gained momentum and slammed into > a parked white GMC truck, silence plate #nn-nnn. Neither vehicle > was occupied at the time. > > These vehicles need to be moved immediately. > > Thanks! >
[ The vehicles seem to have been removed... and, no, the police blotter is not expected to become a regular Newsletter feature.]
Quick-Tip Q & A
A:[[ Special characters for egrep include "$" (match end of line) and [[ "^" (match beginning of line). I had a file containing dollar [[ signs ("$") at the beginning of many lines, and of course, these were [[ the lines I wanted to extract with egrep. [[ [[ By trial and error, I discovered I had to DOUBLE escape the dollar [[ signs. My curiousity aroused and my day already shot, I then [[ discovered that to extract lines beginning with carats ("^"), I would [[ only have to escape the carat once. Like this: [[ [[ mywkstn> cat test.txt [[ wwwww [[ $ xxx [[ yyy $ [[ ^ zzz [[ 000 ^ [[ 11111 [[ mywkstn> egrep "^\\$" test.txt [[ $ xxx [[ mywkstn> egrep "^\^" test.txt [[ ^ zzz [[ mywkstn> [[ [[ Am I cursed? Or is this rational behavior which someone can explain? # # Thanks to Martin Luthi: # This is one of the fine distinctions of the different types of quotes. For example in the bourne shell: 'xxx' disable all special characters in xxx "xxx" disable all special characters in xxx except $, ', and \. \x disable the special meaning of character x so mywkstn> egrep '^\$' test.txt mywkstn> egrep '^\^' test.txt gives you the behaviour that you want. # # Thanks to Rich Griswold: # A few examples using echo should explain what is going on: ~> echo "$" $ ~> echo "\$" $ ~> echo "\\$" \$ ~> echo "^" ^ ~> echo "\^" \^ ~> echo "\\^" \^ Shells perform expansion of special characters inside double quotes, so \$ gets turned into $, but \\$ gets turned into \$. This is because a dollar sign is a special character (used for shell variables), so escaping it with a backslash gets a plain dollar sign. Since the backslash is also a special character (used to escape other special characters), you need two of them to get a single backslash. Since a carat is not a special character, escaping it with a backslash results in \^. However if you choose to escape the backslash with another backslash, that will turn into a single backslash. Confused yet? In summary: "$" -> $ "\$" -> $ "\\$" -> \$ "^" -> ^ "\^" -> \^ "\\^" -> \^ These examples were done using bash, but the results should be similar for sh and ksh. To save some confusion, you can use single quotes so that the shell does not do any expansion of special characters. What you see is what you get: '$' -> $ '\$' -> \$ '\\$' -> \\$ '^' -> ^ '\^' -> \^ '\\^' -> \\^ In closing, if you are wondering what the difference between "$" and "\$" (and "\\$", and "\\\$", ...) is, this should clear it up: ~> echo "$HOME" /home/richard ~> echo "\$HOME" $HOME ~> echo "\\$HOME" \/home/richard ~> echo "\\\$HOME" \$HOME # # And thanks to Greg Newby: # You are probably not cursed. Find a good tutorial on Regular Expressions on the Internet or in a shell script book for a good (though not necessarily illuminating) treatment of this. The short answer is that the dollar sign is how your Unix shell identifies a variable, but it's also a regular expression for the end of line. Some shells interpret the caret as a substitution symbol for command-line editing (tcsh, zsh and others). The caret is also a regular expression symbol for the start of line. The trick you discovered with escapes (which could also work with quotatation marks) is a way of telling the shell whether to interpret these special characters before passing them on to the grep command as arguments. Then, if they're not already interpreted, grep gets to decide whether to interpret them as regular expressions or as simple strings. That is the short answer. Regular expressions are very powerful, but pretty complex, too. There's even a special version of grep, "egrep", to make more powerful use of them than standard "grep." Q: Here are the first four lines of a file that just goes on and on: USER DATE CMDS REAL SYS USER jimbob ALL 2267.0 674011.6 2113.5 258037.0 bobbob ALL 109.0 1335.0 570.8 98.2 amysue ALL 58.0 1223863.9 3003.7 1186547.6 The columns remain consistent for the entire file. I want to sort this on fields like "REAL" and "USER," and thought I could just use Unix "sort"... but the data is delimited by varying numbers of spaces and I can't figure it out. Doing it by hand is taking forever! Can anyone help?
[[ Answers, Questions, and Tips Graciously Accepted ]]
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.