ARSC HPC Users' Newsletter 206, October 13, 2000

ARSC Linux Cluster; Visitor; and Events October 18th

ARSC is honored to have as a working guest all next week, Patricia Kovatch, Manager, HPC Systems Group, Albuquerque High Performance Computing Center (AHPCC), University of New Mexico. Patricia will be helping ARSC install a cluster which will be used in various classes and training events by ARSC and UAF departments.

Two events which will be of interest to those running, building, or considering a cluster are scheduled.

  • Discuss your cluster plans with an expert.

Wednesday 18th October, 12:00, Room 109, Butrovich Bldg.

This event is bring-your-own, "Brown Bag" lunch. Please contact Guy Robinson ( robinson@arsc.edu ) if you plan to attend.

  • Patricia Kovatch will give a talk entitled "Linux Superclusters: Ready for Production?"

Wednesday 18th October, 2:00pm, Room 109, Butrovich Bldg.

ABSTRACT:

The National Computational Science Alliance (NCSA) has created several production superclusters for scientists and researchers to run a variety of parallel applications. Superclusters are large scale clusters built from commodity parts and high performance interconnects. The goal of these clusters is to provide easy-to-use high performance computing systems at reasonable prices. These clusters are linked with other high performance computing resources to a national computational grid through NCSA's Virtual Machine Room (VMR).

This seminar will discuss the details on the design, implementation management and performance of these systems. Examples of cluster configurations and experiences will be given.

CUG SV1 Workshop Oct 23-25

The CUG SV1 Workshop is coming up fast.

Where and When:

The CUG Fall 2000 Workshop will be held in Minneapolis, Minnesota October 23-25, 2000. There will be a workshop dinner on October 24th. Sponsors for the Conference are CUG and Cray Inc.

Technical Program:

The meeting will be one-track for two and one half days starting at 8:30 on Monday, October 23th. The preliminary program is available at:

http://www.fpes.com/cugf00/Pages/prelmprg.htm

Registration and other details:

http://www.fpes.com/cugf00/

Cartesian Topologies in MPI: Part II

[ Thanks to Dr. Don Morton, University of Montana Missoula, for contributing this series of articles. This is part 2 of 2. ]

In part 1 of this series (see: /arsc/support/news/hpcnews/hpcnews205/index.xml ), cartesian topologies and the four basic MPI support functions were introduced. These MPI functions allow you to logically view processors as a mesh configuration, regardless of the network or the physical processor layout.

The following sample program and output should help demonstrate the power of this approach. A lot of information is included as comments in the C++ code.

The program is kept as simple as possible, with many simplifying assumptions. It solves the time-dependent heat diffusion equation using finite difference methods. Simulation parameters are specified in the first section of constants. Users may specify an initial value (e.g. temperature) for all grid points, and a boundary condition to be applied at all boundary points. Users may also specify the distance between grid points (will be the same for vertical and horizontal directions) along with time-stepping parameters. The constants M and N specify the number of grid points IN EACH PROCESSOR in the vertical and horizontal directions, respectively.

The following diagram illustrates the layout of the problem domain for a program with constants as specified in the example, run as follows (in a cluster environment)


    mpirun -np 4 cart 2 2

"cart" is the name of the executable, and the "2 2" command-line arguments specify the number of processors in the vertical and horizontal directions, respectively. The diagram below shows four processors, each possessing four (2 X 2, or M X N)) grid points. It also illustrates the fact that variables "i" and "M" relate to the vertical dimension, and variables "j" and "N" relate to the horizontal dimension. Other aspects of the diagram will be referred to in the comments of the program.



                                  j, N

             ----------------------------------------------
             
                      
                     

             
    u(1,1)    u(1,2)  
    u(1,1)    u(1,2) 

             
                      
                     

             
        PE(0,0)       
        PE(0,1)      

             
                      
                     

             
    u(2,1)    u(2,2)  
    u(2,1)    u(2,2) 

             
                      
                     

   i, M      
--------------------------------------------
 
             
                      
                     

             
    u(1,1)    u(1,2)  
    u(1,1)    u(1,2) 

             
                      
                     

             
        PE(1,0)       
        PE(1,1)      

             
                      
                     

             
    u(2,1)    u(2,2)  
    u(2,1)    u(2,2) 

             
                      
                     

             
                      
                     

             ----------------------------------------------

Diagram 1:  Layout of 2X2 processor mesh, with 2X2 grid points in
              each processor.

Finally, here is the program:


//======================  cart.C  ==================================

#include <unistd.h>
#include <stdlib.h>
#include <math.h>
#include <iostream.h>
#include "mpi.h"

int main(int argc, char * argv[])  {

// In the following, "M" will always refer to vertical  
// direction (jumping from one row to next) and "N" 
// will refer to horizontal direction (jumping from one 
// column to next)
// Likewise, "i" will always be used to index in the vertical
// (M) direction, and "j" will always index in the horizontal
// (N) direction.

// ===========================Some constants ============================
int const M = 2, N = 2;              // # of vert and horiz. interior grid  
                                     // pts. in each processor
double const boundaryValue = 100.0;  // Constant boundary condition
double const initValue = 0.0;        // Initial value of interior points
double const h = 1.0;                // Distance between grid points

// Time-stepping parameters - start, stop, timestep size
double const T0 = 0.0;
double const Tn = 0.1;
double const deltaT = 0.1;

int const numDim = 2;                  // Number of dimensions
// ======================================================================

int MyPE,               // My logical process number
    NumPE,              // Number of processes
    info,               // Return value for error checking
    MP, NP,             // Dimensions of MPxNP processor mesh
    MyLeftNeighbour,    // Logical PE numbers of neighbour PE's
    MyRightNeighbour,
    MyTopNeighbour,
    MyBottomNeighbour,
    i, j;               // loop index variables 

// In the following, U refers to "current" values at each
// grid point, and UNew refers to the new solution for a
// given timestep.   In diagram 1, the "u" variables 
// refer to these unknowns at interior grid points.
double U[M+2][N+2];     // Local grid on each processor
                        // - note that indices will run from
                        //   0 to M+1 (or N+1).  We assume
                        //   that a processor's grid points are
                        //   stored with indices ranging from 1
                        //   to M (or N), and the 0th and Mth
                        //   (or Nth) indices store "buffer" values
                        //   from neighbour processors' boundaries
double UNew[M+2][N+2];  // Maintaining same indexing as U[][] so
                        // that transfers from one to another are
                        // simpler

// Communication buffer for exchanging processor edge values
double buffer[M+N];         // I "wanted" to do something like
                        // double buffer[max(M,N)] but this
                         // didn't seem to work.  The bottom line
                        // is that buffer needs to be large enough
                        // to hold a full row or a full column.

double t;               // Time index variable

// Variables for defining and utilising cartesian topology
int dims[2];            // Dimensions of 2D topology
int periods[2];         // Flags (logical) for defining wraparound
int reorder;            // Logical flag for letting MPI reorder processors
int myCoords[2];        // My coordinates in the processor mesh
MPI_Comm COMM2D;        // Communicator for cartesian topology

MPI_Status status;      // For MPI_Recv()

double wt1, wt2;        // Timers

char myhost[80];        // Stores hostname of machine this process 
                        // is running on

#ifdef DEBUG
cout << "Before MPI_Init()" << endl;
#endif
MPI_Init(&argc, &argv);


// Make sure the executable was run with the correct number of
// arguments - should be two arguments - number ofprocessors
// for each of the vertical and horizontal dimensions.
if ( (argc < 3) 

 (argc > 3) ) {
     cerr << "Usage: cart1 MP NP" << endl << endl;
     cerr << "       MP, NP = dimensions of processor mesh" << endl;
     MPI_Abort(MPI_COMM_WORLD, 0);
     }

#ifdef DEBUG
cout << "Before MPI_Comm_rank()" << endl;
#endif
MPI_Comm_rank(MPI_COMM_WORLD, &MyPE);
#ifdef DEBUG
cout << "Before MPI_Comm_size()" << endl;
#endif
MPI_Comm_size(MPI_COMM_WORLD, &NumPE);

MP = atoi(argv[1]); NP = atoi(argv[2]);

#ifdef DEBUG
     cout << "MP, NP = " << MP << "  " << NP << endl;
     cout << "NumPE = " << NumPE << endl;
#endif

// We fire this up by specifying the number of processors we
// want to use.
// Make sure that MP x NP = Number of PE's
if (MP*NP != NumPE) {
     cerr << "Abort: MP x NP must be equal to number of processors" << endl;
     MPI_Abort(MPI_COMM_WORLD, 0);
     }

#ifdef DEBUG
gethostname(myhost, 80);
cout << "PE" << MyPE << ": running on " << myhost << endl;
#endif


// Set up the cartesian topology (numDim defined as const above)
// Note that for the purposes of the MPI_Cart_create() call, 
// dims[0] is the vertical direction, and dims[1] is horizontal
dims[0] = MP;
dims[1] = NP;
// A value of zero indicates that there is no wraparound.  For
// example, if he have P processors for the horizontal direction,
// if "periods" was set to "1", this would indicate the processor
// 0 in the row had a left neighbour of processor N-1, and processor
// N-1 had a right neighbour of process 0.  If periods is set to "0",
// this would indicate that processor 0 has no left neighbour, and
// process N-1 had no right neighbour.
periods[0] = 0;
periods[1] = 0;

// A "1" indicates that the MPI implementation should reorder
// processor numbers for optimisation.
reorder = 1;

// We create a new communicator called COMM2D for a cartesian
// topology.  This communicator will be used in subsequent calls.
MPI_Cart_create(MPI_COMM_WORLD, numDim, dims, periods, reorder, &COMM2D);

// Find out what my coordinates are.  MPI_Cart_coords will provide
// the coordinates for a specified processor.  So, I indicate that
// I want to find out what "my" coordinates are.  The coordinates
// for processors are indicated in diagram 1.
MPI_Cart_coords(COMM2D, MyPE, numDim, myCoords);
#ifdef DEBUG
     cout << "PE" << MyPE << " - myCoords: (" << myCoords[0] 
          << "," << myCoords[1] << ")" << endl;
#endif

//-------------------------------------------------------------------
// Find out the ranks of my left, right, top, bottom neighbours.
// Note that one or more might have NULL values. 
//-------------------------------------------------------------------
// The "0" of the second argument indicates that we want to
// find adjacent rank numbers in the vertical direction.  The
// "1" of the third argument indicates that we want to find 
// the ranks of the process "1" over.  A "2" would indicate
// that we want to find the rank of the processor "2" processors
// over.  The return arguments (last two) are the ranks of my
// "adjacent" processors.  Now I know who to send my "edge" values
// to, and who to receive "edge" values from. 
MPI_Cart_shift(COMM2D, 0, 1, &MyTopNeighbour, &MyBottomNeighbour);

// The "1" of the second argument indicates that we want to
// find adjacent rank numbers in the horizontal direction.
MPI_Cart_shift(COMM2D, 1, 1, &MyLeftNeighbour, &MyRightNeighbour);

#ifdef DEBUG
     cout << "PE" << MyPE << " - TNeighb " << MyTopNeighbour 
                          << ", - BNeighb " << MyBottomNeighbour 
                          << endl;
     cout << "PE" << MyPE << " - LNeighb " << MyLeftNeighbour 
                          << ", - RNeighb " << MyRightNeighbour 
                          << endl;
#endif

//-------------------------------------------------------------------
// Initialise grid in each processor - use LOCAL numbering in each
// processor - no global
//-------------------------------------------------------------------

// First, the interior, initial values
for (i=1; i<=M; i++)
     for (j=1; j<=N; j++)
          U[i][j] = initValue;

// Next, the boundary values - in each processor, we check
// on each side for presence of a neighbour.  If there is
// no neighbour, we fill in the appropriate buffer cells
// with the boundary condition.  Note that we could easily
// modify this to account for different boundary values
// on different edges (for more interesting problems).

// Left
if (MyLeftNeighbour == MPI_PROC_NULL)
   for (i=1; i<=M; i++)
      U[i][0] = boundaryValue;

// Right
if (MyRightNeighbour == MPI_PROC_NULL)
   for (i=1; i<=M; i++)
      U[i][N+1] = boundaryValue;

// Top
if (MyTopNeighbour == MPI_PROC_NULL)
   for (j=1; j<=N; j++)
      U[0][j] = boundaryValue;

// Bottom
if (MyBottomNeighbour == MPI_PROC_NULL)
   for (j=1; j<=N; j++)
      U[M+1][j] = boundaryValue;

// Start the timer
wt1 = MPI_Wtime();

#ifdef DEBUG
     cout << "PE" << MyPE << ": Beginning to compute....." << endl;
#endif


//-------------------------------------------------------------------
// Begin timestepping
//-------------------------------------------------------------------
t = T0;
while (t < Tn) {



     //-------------------------------------------------------------------
     // Exchange interprocessor boundary values
     // If there is a NULL neighbour, that implies 
     // an edge of the global problem domain - buffer
     // cells for boundary conditions were filled 
     // before timestepping.  To help insure correct
     // synchronisation, message tag will always be the
     // sending PE.
     //-------------------------------------------------------------------

     // Exchange with my left neighbour  
     // If there is a processor to our left, we fill a buffer with
     // with our leftmost column of values.  This will get sent
     // to the left neighbour and received and stored in their
     // "buffer" cells.
     if (MyLeftNeighbour != MPI_PROC_NULL) {
          for (i=1; i<=M; i++)
               buffer[i-1] = U[i][1];
          MPI_Send(buffer, M, MPI_DOUBLE, MyLeftNeighbour, MyPE,
                   COMM2D);
#ifdef DEBUG
          cout << "PE" << MyPE << ": Sent to " << MyLeftNeighbour  << endl;
#endif

          // Likewise, if I have a left neighbour, I want to
          // receive its rightmost values and store them in
          my left column of buffer cells.
          MPI_Recv(buffer, M, MPI_DOUBLE, MyLeftNeighbour, 
                   MyLeftNeighbour, COMM2D, &status);
          for (i=1; i<=M; i++)
               U[i][0] = buffer[i-1];

#ifdef DEBUG
          cout << "PE" << MyPE << ": Recv from " << MyLeftNeighbour  << endl;
#endif
          } // END exchange with left neighbour

     // Exchange with my right neighbour  
     if (MyRightNeighbour != MPI_PROC_NULL) {
          for (i=1; i<=M; i++)
               buffer[i-1] = U[i][M];
          MPI_Send(buffer, M, MPI_DOUBLE, MyRightNeighbour, MyPE,
                   COMM2D);

          MPI_Recv(buffer, M, MPI_DOUBLE, MyRightNeighbour, 
                   MyRightNeighbour, COMM2D, &status);
          for (i=1; i<=M; i++)
               U[i][M+1] = buffer[i-1];
#ifdef DEBUG
          cout << "PE" << MyPE << ": Recv from " << MyRightNeighbour << endl;
#endif
          } // END exchange with right neighbour

     // Exchange with my top neighbour  
     if (MyTopNeighbour != MPI_PROC_NULL) {
          for (i=1; i<=N; i++)
               buffer[i-1] = U[1][i];
          MPI_Send(buffer, N, MPI_DOUBLE, MyTopNeighbour, MyPE,
                   COMM2D);

          MPI_Recv(buffer, N, MPI_DOUBLE, MyTopNeighbour, 
                   MyTopNeighbour, COMM2D, &status);
          for (i=1; i<=N; i++)
               U[0][i] = buffer[i-1];
#ifdef DEBUG
          cout << "PE" << MyPE << ": Recv from " << MyTopNeighbour << endl;
#endif
          } // END exchange with top neighbour

     // Exchange with my bottom neighbour  
     if (MyBottomNeighbour != MPI_PROC_NULL) {
          for (i=1; i<=N; i++)
               buffer[i-1] = U[M][i];
          MPI_Send(buffer, N, MPI_DOUBLE, MyBottomNeighbour, MyPE,
                   COMM2D);

          MPI_Recv(buffer, N, MPI_DOUBLE, MyBottomNeighbour, 
                   MyBottomNeighbour, COMM2D, &status);
          for (i=1; i<=N; i++)
               U[M+1][i] = buffer[i-1];
#ifdef DEBUG
          cout << "PE" << MyPE << ": Recv from " << MyBottomNeighbour << endl;
#endif
          } // END exchange with bottom neighbour


     //-------------------------------------------------------------------
     // Calculate new values
     //-------------------------------------------------------------------
     for (i=1; i<=M; i++) {
          for (j=1; j<=N; j++) {
               UNew[i][j] = (deltaT/(h*h)) * ( U[i-1][j] + U[i+1][j] +
                                               U[i][j-1] + U[i][j+1] -
                                               4*U[i][j] ) 
                                           + U[i][j];
#ifdef DEBUG2
               cout << "PE " << MyPE << ": i,j = " << i << "," << j
                    << " >>>> Ui-1,j: " << U[i-1][j]
                    << "      Ui+1,j: " << U[i+1][j]
                    << "      Ui,j-1: " << U[i][j-1]
                    << "      Ui,j+1: " << U[i][j+1]
                    << "      Ui,j: " << U[i][j]  << endl;
#endif

               }
          }  // END calculating U[i][j]



     //-------------------------------------------------------------------
     // Transfer new values to old
     //-------------------------------------------------------------------
     for (i=1; i<=M; i++)
          for (j=1; j<=N; j++)
               U[i][j] = UNew[i][j];

     t = t + deltaT;
#ifdef DEBUG
          cout << "PE" << MyPE << ": t = " << t << " ..." << endl;
#endif


     }  // END main timestepping loop


// Stop the timer
wt2 = MPI_Wtime();

if (MyPE==0) {
   cout << "Total Time = " << wt2 - wt1 << " seconds" << endl;
   }


MPI_Finalize();


return(0);

}  //  END main()

Running this code on ARSC's T3E with a total of 518,400 grid points, timestepping from t=0 to t=1000 in increments of 0.1 gives the following timings...



PE's             Grid Point Dimensions      Wall Time (seconds)
                   in each processor
1x1 =  1              720x720                       115.4   
2x2 =  4              360x360                        16.0
3x3 =  9              240x240                         6.7
4x4 = 16              180x180                         4.1
5x5 = 25              144x144                         2.7
6x6 = 36              120x120                         1.9

Running on an "old" Linux cluster with only five working nodes, the largest "reasonable" job we could run invovled 14400 grid points. Timings are as follows


PE's             Grid Point Dimensions      Wall Time (seconds)
                   in each processor
1x1 =  1              120x120                       271.6   
2x2 =  4               60x60                         95.4   

So, as with almost all MPI codes, this code is portable and will run on architectures ranging from Cray MPP's to Linux clusters. A big advantage of using the cartesian topologies supported by MPI is that, with little effort it is possible to write a finite difference code that is portable not only across a wide range of architectures, but can be run on a variable number of logical processor configurations with no modification. By using this scheme, applications which rely on "nearest neighbour" communications can be written for a wide variety of configurations.



MAKEFILE
--------

BINARY=cart1


##############   MPICH on Solaris cluster at U. Montana      ###############
CC=mpiCC
INCLUDES=
LIB=
############################################################################

##############   Cray T3E at ARSC                            ###############
#CC=CC
#VAMPIR=/usr/local/pkg/VAMPIRtrace/current
#INCLUDES=
#LIB=-L${VAMPIR}/lib -lVT -lpmpi -lmpi
############################################################################


###########  DEFINES  ###################
#
#    DEBUG     -    Fires up lots of intermediate output statements
#
#########################################

#DEFS=-DDEBUG -DIDENTSYS
#DEFS=-DIDENTSYS
#DEFS=-DDEBUG -DPRINT -DDEBUG2
DEFS=

${BINARY}: ${BINARY}.o
        ${CC} -o ${BINARY} ${BINARY}.o ${LIB}

${BINARY}.o : ${BINARY}.C Makefile
        ${CC} -c ${BINARY}.C ${DEFS} ${INCLUDES}

Quick-Tip Q & A


A:[[ My NQS script is written in "sh" (the first line is, "#!/bin/sh").  
  [[ I'd like to switch programming environments from within this script,
  [[ but it claims it can't find the "module" command:
  [[
  [[  CHILKOOT$  cat job.e71167
  [[  sh-56 /usr/spool/nqe/spool/scripts/++GMz+++++0+++[2]: module: not 
  [[    found.
  [[
  [[ I already export my environment to the script (#QSUB -x), and 
  [[ in my interactive environment, module gives me no trouble.  Any
  [[ suggestions?


  Three solutions:

  1) remove "#!/bin/sh", assuming it's not serving any purpose and your
     script can run under your environment shell (exported via "#QSUB -x").

  2) replace "#!/bin/sh" with "#QSUB -s /bin/sh", if you want your script
     to run under /bin/sh.

  3) leave "#!/bin/sh", but initialize modules for "sh" by inserting 
     this line into the script, above the "module" command:
       .  /opt/modules/modules/init/sh
     (the "." says to execute the command).

  Note: Specifying this, "#!/bin/sh" as the first line of a qsub script
  is valid, but probably doesn't do what you expect.  Unless you
  understand what "man qsub" says under "-s shell_name", you should
  probably stick with "#QSUB -s /bin/sh".




Q: My Fortran 90 program calls several subroutines and functions. I
   wanted to inline them...
 so I recompiling everything with "-Oinline3".
   Performance was NOT improved but the friendly compiler told me this:

      cf90-1548 f90: WARNING in command line
        -O inline3 is no longer the most aggressive form of inlining.  

   and "explain cf90-1548" told me this:

      The most aggressive form of inlining is now obtained thru
      -Oinline4.  - Oinline3 is a new form of inlining.  -Oinline3
      invokes leaf routine inlining.  A leaf routine is a routine which
      calls no other routines.  With -Oinline3 only leaf routines are
      expanded inline in the program.

   I wish folks would spell "through" correctly, but... back to
   my story...

   I was, of course, very excited to try "-Oinline4", and recompiled
   everything again.  Disappointment!  Performance was NOT improved.

   Next, I recompiled for flow tracing, "f90 -ef".  "flowview" showed
   the most frequently called subroutine.  It was practical, so I
   inlined it MANUALLY and, lo and behold, performance improved
   significantly.


   What am I missing, here?

[[ Answers, Questions, and Tips Graciously Accepted ]]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top