ARSC T3E Users' Newsletter 199, July 7, 2000

ARSC SGI Cluster and the T3E

[ Editor's Note:

Many thanks to Don Morton of the University of Montana, who provided this article.

ARSC is investigating different cluster technologies and has been developing the SGI cluster described below. The ARSC SGI cluster isn't quite ready for general use, but interested users are encouraged to contact us for more information.

Don helped solve and identify several technical issues, and as always, it was great sharing our space with him on his annual migration north. ]

Building Clusters for HPC Education and Development Activities (Or, What I Did on My 2000 Alaska "Vacation")
For the past six years I have spent academic years at small universities, and have gained exposure to the world of high performance computing by spending time, mostly summers, at the Arctic Region Supercomputing Center. During the summers I gained much experience in supercomputing, and was able to take that back to the Lower 48 during the academic years in the form of teaching and research.

With the evolution of Linux during the 1990's, it became a natural to utilize it on PC's to build low-cost parallel computing environments which could be used to continue research activities and to train students in parallel computing. The key to all of this was to make use of software that could be used in both supercomputing and low-cost cluster environments.

During the last half of June, I came up to Fairbanks for the purpose of helping ARSC configure a cluster of SGI workstations that might ultimately be used as a training platform in parallel computing, and a launching point for use of higher-end architectures such as yukon. The primary goal was to try out software that would make a cluster manageable for this task, while offering a user interface somewhat similar to that on the CRAY T3E.

This work was well-supported by ARSC, in particular Liam Forbes and Dale Clark who performed the software installations for me.

This article focuses on the use, testing, and installation of MPI on ARSC's SGI cluster, an evolving Linux cluster at UAF's Physics Department, a Solaris cluster at The University of Montana, and ARSC's Cray T3E (yukon).

Native MPI on the SGI Cluster

Message Passing Interface (MPI), a standard which defines widely-used programming interfaces for parallel programming, is a logical tool for achieving portability of programs across a wide range of parallel architectures. With MPI it's quite easy to learn and develop parallel codes on low-cost clusters and then move them to supercomputers without any sort of modification. So, MPI is a logical first-step towards achieving common environments between low-cost clusters and supercomputers.

The SGI cluster, by default, has a native implementation of MPI. This implementation is integrated into the "arrayd" environment which offers a high-level interface to the cluster environment. For example, one may execute the command "array who" to get a list of the users on the cluster.

"arrayd" is an SGI attempt at creating a single, unified view of the cluster. Though I didn't investigate it fully, it appears to do this to some degree. On the other hand, "arrayd" is not typically found on other cluster platforms, so doesn't fit in with the goal of establishing common environments across parallel platforms.

To run a simple, parallel, Hello World program across the SGI cluster, using the native MPI environment, one must specify the hosts to run on. In my opinion, this forces the user to become a little too "intimate" with the local environment, particularly if the user is migrating between several different systems.

A default run on the designated cluster server would look like:


--------------------------------------------------------------
csrv% CC -o pHello pHello.C -lmpi
csrv% mpirun -np 4 pHello
PE0: running on csrv
PE1: running on csrv
PE2: running on csrv
PE3: running on csrv
csrv% 
--------------------------------------------------------------
To specify particular machines, first, you have to know the names of the available machines, then you have to include in the command line the names of the machines, and how many copies to run on each machine - for example:

--------------------------------------------------------------
csrv% mpirun host1 1, host2 1, host3 1, csrv 1  pHello
PE3: running on csrv
PE2: running on host1
PE0: running on host2
PE1: running on host3
csrv% 
--------------------------------------------------------------
If you get the command-line syntax just a little wrong, your program simply won't run as expected.

MPICH on the SGI Cluster

Because of the "different" behavior of the native MPI implementation on the SGI cluster, we thought it would be interesting to install the MPICH implementation of MPI, available for a wide range of architectures from Linux clusters to the T3E. In addition to its ease of use and common interface, several important parallel programming software tools (e.g. Totalview Debugger, VAMPIR performance analyzer) are compatible only with the MPICH implementation.

So, MPICH was installed and tested as the default MPI implementation which established the same user environment you would encounter on other clusters that use MPICH. Thus, one can move easily between environments without having to worry about a lot of site-specific peculiarities.

MPI on the Cray T3E (yukon)

There are two MPI implementations on yukon, the default implementation, and an MPICH implementation. An example of compiling and executing with the default implementation follows:

--------------------------------------------------------------
yukon% CC -o pHello pHello.C
yukon% mpprun -n 4 pHello
PE1: running on yukon
PE3: running on yukon
PE0: running on yukon
PE2: running on yukon
yukon% 
--------------------------------------------------------------
Naturally, we expect the hostname to be the same on each T3E processor.

To use the MPICH implementation, it is necessary to include header files, and link with libraries in /usr/local/pkg/mpich/current. There is no mpiCC script on yukon for compiling C++ MPI programs, but by converting the HelloWorld program to C, we can compile and run on yukon as:


--------------------------------------------------------------
yukon% /usr/local/pkg/mpich/current/bin/mpicc -o pHello pHello.c
yukon% mpprun -n 4 pHello
PE 1: running on yukon
PE 3: running on yukon
PE 0: running on yukon
PE 2: running on yukon
yukon% 
--------------------------------------------------------------
Presumably, the yukon MPICH distribution will be upgraded to include the mpiCC script for C++ programs but, for now, one can compile and link, as alluded to in Newsletter 158, as follows:

--------------------------------------------------------------
yukon% CC -c pHello.C -I/usr/local/pkg/current/include 
yukon% CC -o pHello -L/usr/local/mpich/current/lib/cray_t3e/t3e -lmpi pHello.o
yukon% mpprun -n 4 pHello
PE2: running on yukon
PE0: running on yukon
PE1: running on yukon
PE3: running on yukon
--------------------------------------------------------------

Some Timings on Various Platforms and Environments

Just for fun, and a demonstration of portability, we'll look at the execution of a parallel Jacobi method for solving dense systems of equations. The program and Makefiles can be accessed at:

http://www.arsc.edu/~morton/NewsSupps/

by clicking on News199Supps.tgz

The program is run like this:


  mpirun -np 4 pJacobi 4000  (MPICH clusters)
  mpprun -n 4 pJacobi 4000   (T3E)
where the 4000 is the dimension of the test system of equations (the system is generated within the program).

The sample test run consisted of a very simple system (i.e. starting with the Identity Matrix!) of dimension 4000.

The following denotes the wall time (in seconds) required for solving the internally-generated system of equations on the various platforms and environments.


                                  Number of Processors

Platform/Environment                   2         4        8          

yukon / Native MPI                    0.74      0.46       *
yukon / MPICH                         1.08      0.58       *
SGI cluster / Native MPI              2.85      1.84       *
SGI cluster / MPICH                   2.77      2.38       *
Intel Solaris / MPICH                0.0018     0.0010   0.00088
Intel Linux / MPICH                   54.8        *        *
  • The T3E has 450 MHz DEC Alpha chips
  • The SGI machines have 270 MHz MIPS R12000 chips
  • The Intel Solaris machines have 450 MHz Pentium II chips
  • The Intel Linux machines have 450 MHz Pentium III chips
The above timings exhibit great variation. There are clearly a large number of differences in the computing environments that affect execution speed. Some implementations are using native compilers, while others use g++, the various environments are optimized for different computations and, of course, in the multiuser cluster environments the load on particular machines varies at any given time. For example, on a previous "SGI cluster / Native MPI" test run, it took 7 seconds for 2 processors, and 11 seconds for 4 processors. As it turns out, one of the machines had a big load on it from other users' jobs.

Conclusions / Future Work

The above work was a first exercise in exploring programming environments that might allow researchers and students at small institutions, with limited resources, to learn about and develop parallel programs in a manner that will be compatible with usage at the supercomputing centers. In this way, "inefficient" activities such as development and debugging may be performed "offline," reserving supercomputing resources for final testing and production runs.

As demonstrated, MPI clearly serves as a portable tool for learning and developing across a wide range of platforms. There are differences between the MPI implementations, but use of MPICH can often create a "common" environment, while maintaining compatibility with other software tools.

An evaluation copy of Totalview, a portable, graphical parallel debugger was also installed on the SGI cluster (and on a Linux cluster), and will be discussed in a future newsletter.

Finally, when one begins to accommodate an increasing number of users on their clusters, it becomes necessary to worry about the scheduling of jobs. I have been investigating a free, portable scheduling system, Portable Batch System (PBS), which is much like the NQS used on yukon. With its use, users can submit jobs as they do with NQS, and the jobs will be launched on the most appropriate machines in the cluster, or held until processors are available. PBS is highly-configurable. To date, I have installed and tested it on a Linux cluster, and it has been installed on ARSC's SGI cluster with minor problems related to Kerberos authentication. This, too, will be discussed in a future newsletter.

Checkpointing Long Running Jobs

[ Many thanks to the Technical Assistance Group at the NASA Center for Computational Sciences (NCCS) for permission to print this article. It was taken from their T3E help pages, at:

http://esdcd.gsfc.nasa.gov/ESS/crayt3e/CrayT3E.html

NCCS is a part of the Earth and Space Data Computing Division at Goddard Space Flight Center. The Technical Assistance Group may be reached at: tag@nccs.gsfc.nasa.gov . ]

Batch jobs on the T3E are restricted by maximum runtime. Given this constraint, how can you execute a program that takes many hours? By designing it so that it writes out intermediate data at intervals and can be restarted by reading in this data and continuing from the point at which the previous execution wrote out the data. Then you can submit it to the batch queue as many times as necessary.

Since most parallel programs that need long run times are iterative, it is convenient to design a program so that it writes out intermediate data every N iterations, where N is chosen so that it will be reached with certainty before the batch queue time limit expires.

Suppose your program iterates until some convergence criterion is met:


        Initialize
        Read input data and parameters
        WHILE NOT reached convergence
                Calculation
        END-WHILE
        Write results
The central loop might be modified to do N iterations and exit:

        J = 1
        WHILE NOT reached convergence AND J < N
                Calculation
                J = J+1
        END-WHILE
        IF J = N
                Write intermediate data
        ELSE
                Write results
        END-IF
Or the central loop could be modified to continue running after saving intermediate data by resetting a counter:

        J = 0
        WHILE NOT reached convergence
                Calculation
                J = J+1
                IF J = N
                        Write intermediate data
                        J = 0
                END-IF
        END-WHILE
With this latter method, the iterations that occur from N+1 up until the job uses up its run time will be wasted, because on restart it will be restored to where it was after iteration N. On the other hand, this design is useful if the program is ever to be run where computing time is unlimited, subject only to random events such as hardware failure or power outage.

Either way, the original beginning of the program:


        Initialize
        Read input data and parameters
must be modified to read conditionally the intermediate data file. This could be accomplished by:

        Initialize
        IF intermediate data file exists
                Read intermediate data file
        ELSE
                Read input data and parameters
Obviously, "intermediate data" in this context will consist not only of arrays but also of whatever individual variables are needed to restore the program to the environment it had after the Nth iteration.

Quick-Tip Q & A



A:{{ How can I initialize common blocks in Fortran 90?

Thanks to Alan Wallcraft for this reply:


  F90 allows initialization as part of a type declaration statement, 
  as an alternative to a DATA statement, but the rules for common block 
  initialization are the same in F90 as in F77.  Blank common cannot 
  be initialized at all, and named common can only be initialized via a 
  BLOCK DATA program unit.  Many compilers allow common to be initialized
  outside BLOCK DATA, but that is a non-portable extension to the language.
  
  A better approach in F90 is to use a MODULE in place of COMMON.
  
  MODULE COMMON_A
      SAVE
      INTEGER :: KA   = 1
      REAL    :: X    = 9.0
      REAL    :: Y(3) = (/ 1.0, 8.0, 99.0 /)
  END MODULE
  
  This is approximately equivalent to:
  
  BLOCK DATA
      INTEGER   KA
      REAL      X,Y
      COMMON/A/ KA,X,Y(3)
      SAVE  /A/
      DATA KA, X, Y / 1, 9.0, 1.0, 8.0, 99.0 /
  END
  
  On the T3E, saved module variables are symmetric data objects (accessable
  via SHMEM) just like common block variables.
  
  However, a module is not "sequence associated", e.g. it isn't necessary
  for the module's KA to become immediately before X in memory.  Usually
  this an an advantage, since the fixed memory order of a common can lead 
  to poor memory access performance on some machines and this should be
  less likely to happen with a module (and a good compiler).  
  
  Any subroutine that accesses variables in the module must start with a 
  USE statement.
  
  SUBROUTINE KA_8
      USE COMMON_A
      KA = 8
  END
  
  As an alternative to:
  
  SUBROUTINE KA_8
      INTEGER   KA
      REAL      X,Y
      COMMON/A/ KA,X,Y(3)
      SAVE  /A/
      KA = 8
  END





Q: What does the "R" mean?

     Rr--------   1 freddy  mygroup    2772992 Jun 28 07:30 core

   And, while we're at it, I have no clue what program I crashed to
   create this core file. How can I find out?

[ Answers, questions, and tips graciously accepted. ]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top