ARSC HPC Users' Newsletter 219, May 4, 2001
The Scalable Modeling System: A High-Level Alternative to MPI
[[ Thanks to Mark Govett of NOAA for this submission. Note that SMS has been ported to the Cray T3E. ARSC users should contact email@example.com if they are interested in trying it. ]]
M. GOVETT, J. MIDDLECOFF, L. HART, T. HENDERSON, AND D.SCHAFFER
NOAA/OAR/Forecast Systems Laboratory 325 Broadway, Boulder, Colorado 80305-3328 USA Email: firstname.lastname@example.org
The National Oceanic and Atmospheric Administration's (NOAA's) Forecast Systems Laboratory (FSL) has been using MPP systems to run weather and climate models since 1990. Central to FSL's success with MPPs is the development of the Scalable Modeling System (SMS). SMS is a directive-based parallelization tool that translates Fortran code into a parallel version that runs efficiently on both shared and distributed memory systems including IBM SP2, Cray T3E, SGI Origin, Fujitsu VPP, and Alpha Linux clusters. SMS was designed to reduce the time required to parallelize serial codes and to maintain them.
FSL has parallelized several weather and ocean models using SMS including the Global Forecast System (GFS) and the Typhoon Forecast System (TFS) for the Central Weather Bureau in Taiwan, the Rutgers University Regional Ocean Modeling System (ROMS), the National Centers for Environmental Prediction (NCEP) 32 km Eta model, FSL's high resolution limited area Quasi Non-hydrostatic (QNH) model, and FSL's 40 km Rapid Update Cycle (RUC) model currently running operationally at NCEP. These models contain structured regular grids that are resolved using either finite difference approximation or Gauss-Legendre spectral methods. SMS also provides support for mesh refinement, and can transform data between grids that have been decomposed differently (eg. grid and spectral space). While the tool has been tailored toward finite difference approximation and spectral weather and climate models, the approach is sufficiently general to be applied to other structured grid codes.
SMS consists of two layers built on top of the Message Passing Interface (MPI) software. The highest layer is a component called the Parallel Pre-Processor (PPP), which is a Fortran code analysis and translation tool that translates the directives and serial code into a parallel version of the code. This parallel code relies on lower layer SMS libraries to accomplish operations including inter-process communication, process synchronization, and parallel I/O.
SMS directives are available to handle parallel operations including data decomposition, communication, local and global address translation, I/O, spectral transformations and nesting. As the development of SMS has matured, the time and effort required to parallelize codes for MPPs has been reduced significantly. Code parallelization has become simpler because SMS provides support for advanced operations including incremental parallelization and debugging.
Incremental Parallelization - Since code parallelization is easier to test and debug when it can be done in a step-wise fashion, SMS allows the user can insert directives to control when sections of code will be executed serially. This speeds parallel code development by allowing the user to test the correctness of parallelization during intermediate steps. Once assured of correct results, the user can remove these serial regions and further parallelize their code.
Debugging Support - SMS debug directives have been developed that significantly streamline model parallelization, reduce debugging time, and simplify code maintenance. These directives are used to verify that halo (ghost) region values are up to date, and to compare model variables for runs using different numbers of processors. If differences are found SMS displays the names of the variables, the array locations (e.g. the i, j, k index) and the corresponding values from each run, and then terminates execution. The ability to compare intermediate model values anywhere in the code has proven to be a powerful debugging tool during code parallelization. The effort required to debug and test a recent code parallelization was reduced from an estimated eight weeks down to two simply because the programmer did not have to spend inordinate amounts of time determining where the parallelization mistakes were made.
SMS provides a number of performance optimizations. The SMS run-time libraries have been optimized to speed inter-processor communications using techniques such as aggregation. Array aggregation permits multiple model variables to be combined into a single communication call to reduce message-passing latency. SMS also allows the user to perform computations in the halo region to reduce communications. High performance I/O is provided by SMS too. Since atmospheric models typically output forecasts several times during a model run, SMS can output these data asynchronous to model execution. This optimization can lead to significantly faster execution times.
Finally, SMS provides a variety of performance tuning options accessible at run-time via environment variables. These options can be used to configure the layout of processors to the problem domain, to designate the number of processors used to output decomposed arrays to disk, and to designate when decomposed arrays will be gathered and written to a single file, or written to separate output files by each process.
Recently, a performance comparison was done between the hand-coded MPI based version of the Eta model running operationally at NCEP on 88 processors, and the same Eta model parallelized using SMS. The MPI Eta model was considered a good candidate for fair comparison since it is an operational model used to produce daily weather forecasts for the U.S. National Weather Service and has been optimized for high performance on the IBM SP2. Fewer than 200 directives were added to the 19,000 line Eta model during SMS parallelization. Results of this study indicate nearly identical performance of these models on NCEP's IBM SP2. Additional performance comparisons are planned on other MPP systems.
The reader is encouraged to download SMS and use this freely available software. More information about SMS including software and documentation is available at:
Copper Mountain Multigrid Meeting
The notes from the most recent of the Copper Mountain Multigrid meetings are now available on the web at:
The new materials available include two tutorials, one on the basics of multigrid solvers and another on cache based algorithms. Both are good introductions on how to improve your algorithms and programs to get better performance, both computationally and numerically.
Papers presented at the conference cover a broad range of topics, from theoretical consideration of the multigrid algorithm, example applications to problems, and actual implementation issues specific to particular platforms. One paper also describes "Funding Opportunities in Computational and Numerical Mathematics at the NSF."
Maintaining Multiple Kerberos Caches
Several ARSC users and staff authenticate at more than one site, perhaps at ERDC and ARSC. This creates the problem that kinit overwrites the existing kerberos ticket each time. Going back and forth with ktelnet means running kinit over and over.
One solution is to maintain separate ticket cache files for each site.
kinit checks the value of the environment variable "KRB5CCNAME" for the name of the cache to use, and thus, by switching it, you can easily maintain multiple caches.
Here's an example of shell aliases that simplify switching between two ticket caches. (These are for the korn shell, but csh users should be able to modify them.)
alias \ ccerdc='export KRB5CCNAME=/tmp/krb5_erdc' \ ccarsc='export KRB5CCNAME=/tmp/krb5cc_xdm_0' \ cclist='CCTEMP=$KRB5CCNAME;ccarsc;print "\n*****\n";klist;print "\n*****\n";ccerdc;klist;KRB5CCNAME=$CCTEMP;print "\n***** Current: $KRB5CCNAME\n"'
To use, type "ccerdc" before you kinit at ERDC, and then again before trying to ktelnet to ERDC (if there's been an intervening "ccarsc"). Similarly, type "ccarsc" before you kinit at ARSC. "cclist" reports the status of both caches and tells you which is current.
sgi$ ccerdc sgi$ kinit nameuser@WES.HPC.MIL sgi$ ccarsc sgi$ kinit username@ARSC.EDU sgi$ krlogin somehost.arsc.edu sgi$ ccerdc sgi$ krlogin -l nameuser somehost.wes.hpc.mil> sgi$ ... sgi$ ... etc... sgi$ ... sgi$ cclist ***** Ticket cache: /tmp/krb5cc_xdm_0 Default principal: username@ARSC.EDU Valid starting Expires Service principal 05/03/01 12:31:32 05/03/01 22:31:22 krbtgt/ARSC.EDU@ARSC.EDU 05/03/01 12:45:56 05/03/01 22:31:22 host/chilkoot.arsc.edu@ARSC.EDU ***** Ticket cache: /tmp/krb5_erdc Default principal: nameuser@WES.HPC.MIL Valid starting Expires Service principal 05/03/01 13:27:44 05/03/01 23:27:29 krbtgt/WES.HPC.MIL@WES.HPC.MIL ***** Current: /tmp/krb5_erdc
Note that if you lock your local workstation screen, you'll need to authenticate with the current site, whichever it is, to get back in.
Quick-Tip Q & A
A:[[ My Fortran code reads from an unformatted input file (originally [[ written on a Cray T3E). Now I'm attempting to run it on a Cray SV1 [[ and it stops with, [[ [[ "A READ operation tried to read past the end-of-file" [[ [[ What should I do next? What's wrong?
Unformatted files can be faster, and (often, though certainly not always) smaller than unformatted files. However, their biggest drawback is nonportability. The term "unformatted" is a bit misleading. When you specify unformatted, the values aren't "without format", they're formatted in a very specific way. You're basically telling the machine to output numbers with a minimum of alterations from their native binary images. Thus, even machines from the same vendor can have different "unformatted" formats -- as do the T3E and SV1.
The important thing here, however, is that even while the "unformatted" formats are different between the machines, this won't necessarily produce a run-time error.
In many cases a read statement can happily load the strings of bits into variables, producing something entirely incorrect. Thus, if the number of records is read in from a header it could contain an absurd value, causing the loop not to terminate when it should resulting in the error message above.
In general, unformatted files are only recommended in cases where I/O is a bottleneck to performance. If large-scale I/O only occurs in a pre- or post-processing stage and portability (for the data or the code) is desired, formatted I/O should be used.
That said, the Cray "assign" statement for Fortran programs, or FFIO for C programs can be used to convert many unformatted files at run time. For more, see man assign , the article on Safer Assignment in issue 192 ( /arsc/support/news/t3enews/t3enews192/index.xml ), or man intro_ffio Q: I wanted to share one data file with my group, so I did this: chmod g+r ~/FunnelWort/19930201/data001 I know, I must also chmod all subdirectories in the path. This is a pain. I often forget this step, and the only way I know to do it is one directory at a time: chmod g+rX ~/FunnelWort/19930201/ chmod g+rX ~/FunnelWort chmod g+rX ~ (Using "chmod -R" is not an option, as I don't want to share everything.) Is there a better way to give group access to a file down the directory tree?
[[ Answers, Questions, and Tips Graciously Accepted ]]
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.