ARSC T3E Users' Newsletter 117, April 8, 1997

Yukon Accepted

There have been some late nights around here, and they were well spent. Yukon has been accepted, with over 96% uptime during its acceptance period and is running in pre-production mode.

Yukon User Testing Continues

We have a handful of non-staff users on yukon. Here's a message from Don Morton of Cameron University (he didn't intend this for the newsletter -- which, to me, makes it more interesting -- but I asked and he said okay).


> Subject: Preliminary T3E times, comments, etc.
> 
> What follows are some T3D vs.T3E times on an MPI hydrologic code.  I'll
> also add in T3D CRAFT times for code.  All times in seconds.
> 
> PE's    Initialization Time      Single Timestep      Single Timestep
>          T3D       T3E            T3D    T3E            T3D-CRAFT
> 
>  2       26.9      3.64           6.18    2.16           6.23
>  4       26.9      3.76           4.80    1.74           4.22
>  8       26.8      3.64           4.58    1.85           3.31
> 16       26.9      3.64           5.62    2.47           2.86
> 
> The "Initialization Time" is time needed to read input files,
> set up data structures, etc. I know the T3E was going to have
> much better I/O performance, and this seems to show it... Nice
> to be rid of the Y-MP/T3D bottleneck!
> 
> Looks like I still have a communications bottleneck which I'll
> probably try to resolve with Shmem.  The bottleneck is a small
> loop which, many times within a given timestep, exchanges a small
> number of values (approx. 5-15).  Seems that the MPI latency is
> pretty high, and CRAFT seems to get around some of this when I use
> shared arrays.

Shmem Bandwidth

In newsletter #100 , we presented a simple shmem program which measures bandwidth, and used it on the T3D to compare EPCC's 2-sided MPI routines with shmem. In this newsletter, we use essentially the same program to look at bandwidth on ARSC's T3E. It gives a T3D maximum of ~126 mb/s and a T3E maximum which, in our tests, ranged from about 250 mb/s to ~333 mb/s. This variance on the T3E is discussed below.

First, here's how the program works. It runs on two PEs, the sender and receiver. The two processes synchronize at a barrier. The sender then initiates an asynchronous send using shmem_put() while the receiver waits at a shmem_wait() for the last item in the buffer which serves as a signal that the entire buffer has arrived. These two functions are timed, the results printed, and the processes loop back and repeat the sequence with a larger buffer.

We hope that a "red flag" went up when you read the statement: "the receiver waits at a shmem_wait() for the last item in the buffer."

On the T3E, if adaptive routing is enabled, this is not a reliable way to test for the arrival of data. However, we have not enabled adaptive routing yet on yukon, and, just to be sure, ran a version of the program which examines every buffer element testing that the entire buffer was received. On the T3D, buffer packets are guaranteed to arrive in order, so testing the last element does test the entire buffer. We will discuss adaptive routine in greater detail in future newsletters.

The following output is from a run on ARSC's T3D. nwords gives the size in words of the buffer exchanged; ticks is the number of clock ticks required for the transfer to complete; and usecs is the number of microseconds required. The mbr field is bandwidth in megabytes per second. Runs on the T3D always give bandwidths within a couple percent of the values shown in this run, and are not influenced by neighboring jobs.


T3D Run
====================
RECEIVER: nwords=1 ticks=1272 usecs=8.479152; mbr=0.943491
RECEIVER: nwords=10 ticks=672 usecs=4.479552; mbr=17.858930
RECEIVER: nwords=100 ticks=1533 usecs=10.218978; mbr=78.285718
RECEIVER: nwords=1000 ticks=10109 usecs=67.386591; mbr=118.717980
RECEIVER: nwords=10000 ticks=95708 usecs=637.989500; mbr=125.393913
RECEIVER: nwords=100000 ticks=957078 usecs=6379.881672; mbr=125.394175
RECEIVER: nwords=1000000 ticks=9479730 usecs=63191.877442; mbr=126.598549
SENDER: nwords=1 ticks=1010 usecs=6.732660
SENDER: nwords=10 ticks=725 usecs=4.832850
SENDER: nwords=100 ticks=1619 usecs=10.792254
SENDER: nwords=1000 ticks=10151 usecs=67.666563
SENDER: nwords=10000 ticks=95819 usecs=638.729426
SENDER: nwords=100000 ticks=956952 usecs=6379.041756
SENDER: nwords=1000000 ticks=9479679 usecs=63191.537476

The next batch of results are from ARSC T3E runs. As they show, the bandwidth the program achieves on the T3E is both better and far more variable than that which it achieves on the T3D, and it is influenced by other jobs.

This contrast between the T3D and T3E variability makes sense considering that this program runs on exactly two PEs and that in the T3D architecture, PEs are paired up, two per node, with a dedicated route between them. Communication between the sender and receiver pair on the T3D is unaffected by traffic on the torus, even when some of that traffic passes through the node on which they are running. On the T3E, however, every PE is its own node, and traffic between any given pair of nodes shares the torus network with traffic between nodes of other applications.

For the T3E runs, I used mppview -s all to capture the configuration at the time of each of three runs (the job appears in these displays as shmem_wait ). There was another user's 64-node job running throughout my runs. The best bandwidth was achieved in the third run, below, after I inserted a 16 node spacer job which separated the 2-node job from the 64-node job, giving it a relatively quiet route through the torus. (The spacer program, route , does introduce some traffic on the torus.)


T3E Run 1:
====================
    UID  PPID     APID     Run Time  PEs  Base  Command
   ----- ----- ---------- -------- ----- ----- ----------
     162  7838 0x0186dc13 03:29:28    64     0 a.out.64
    1235 11631 0x01459c14 00:00:20     2    64 shmem_wait
====================
RECEIVER: nwords=1 ticks=3210 usecs=10.700000; mbr=0.747664
RECEIVER: nwords=10 ticks=1966 usecs=6.553333; mbr=12.207528
RECEIVER: nwords=100 ticks=1834 usecs=6.113333; mbr=130.861505
RECEIVER: nwords=1000 ticks=10640 usecs=35.466667; mbr=225.563910
RECEIVER: nwords=10000 ticks=92944 usecs=309.813333; mbr=258.220003
RECEIVER: nwords=100000 ticks=950161 usecs=3167.203333; mbr=252.588772
RECEIVER: nwords=1000000 ticks=9421812 usecs=31406.040000; mbr=254.728071
SENDER: nwords=1 ticks=2900 usecs=9.666667
SENDER: nwords=10 ticks=1614 usecs=5.380000
SENDER: nwords=100 ticks=950 usecs=3.166667
SENDER: nwords=1000 ticks=9542 usecs=31.806667
SENDER: nwords=10000 ticks=92649 usecs=308.830000
SENDER: nwords=100000 ticks=949593 usecs=3165.310000
SENDER: nwords=1000000 ticks=9421195 usecs=31403.983333


T3E Run 2:
====================
    UID  PPID     APID     Run Time  PEs  Base  Command
   ----- ----- ---------- -------- ----- ----- ----------
     162  7838 0x0186dc13 05:20:44    64     0 a.out.64
    1235 12203 0x01459c11 00:00:09     8    64 route     <--8 node "spacer"
    1235 12212 0x0145bc12 00:00:04     2    72 shmem_wait
====================
RECEIVER: nwords=1 ticks=3358 usecs=11.193333; mbr=0.714711
RECEIVER: nwords=10 ticks=2186 usecs=7.286667; mbr=10.978957
RECEIVER: nwords=100 ticks=2054 usecs=6.846667; mbr=116.845180
RECEIVER: nwords=1000 ticks=10552 usecs=35.173333; mbr=227.445034
RECEIVER: nwords=10000 ticks=89032 usecs=296.773333; mbr=269.565999
RECEIVER: nwords=100000 ticks=932565 usecs=3108.550000; mbr=257.354715
RECEIVER: nwords=1000000 ticks=9403328 usecs=31344.426667; mbr=255.228787
SENDER: nwords=1 ticks=3008 usecs=10.026667
SENDER: nwords=10 ticks=1746 usecs=5.820000
SENDER: nwords=100 ticks=817 usecs=2.723333
SENDER: nwords=1000 ticks=9510 usecs=31.700000
SENDER: nwords=10000 ticks=88518 usecs=295.060000
SENDER: nwords=100000 ticks=931773 usecs=3105.910000
SENDER: nwords=1000000 ticks=9402727 usecs=31342.423333


T3E Run 3:  (Best Performance)
===============================
    UID  PPID     APID     Run Time  PEs  Base  Command
   ----- ----- ---------- -------- ----- ----- ----------
     162  7838 0x0186dc13 05:26:43    64     0 a.out.64
    1235 12296 0x0145bc13 00:00:15    16    64 route    <--16 node "spacer"
    1235 12313 0x0145f414 00:00:08     2    80 shmem_wait
====================
RECEIVER: nwords=1 ticks=3226 usecs=10.753333; mbr=0.743955
RECEIVER: nwords=10 ticks=1958 usecs=6.526667; mbr=12.257406
RECEIVER: nwords=100 ticks=1562 usecs=5.206667; mbr=153.649168
RECEIVER: nwords=1000 ticks=8008 usecs=26.693333; mbr=299.700300
RECEIVER: nwords=10000 ticks=72100 usecs=240.333333; mbr=332.871012
RECEIVER: nwords=100000 ticks=729933 usecs=2433.110000; mbr=328.797301
RECEIVER: nwords=1000000 ticks=7262705 usecs=24209.016667; mbr=330.455388
SENDER: nwords=1 ticks=2928 usecs=9.760000
SENDER: nwords=10 ticks=1778 usecs=5.926667
SENDER: nwords=100 ticks=937 usecs=3.123333
SENDER: nwords=1000 ticks=7585 usecs=25.283333
SENDER: nwords=10000 ticks=71993 usecs=239.976667
SENDER: nwords=100000 ticks=729573 usecs=2431.910000
SENDER: nwords=1000000 ticks=7262383 usecs=24207.943333


====================

Here's the code. As noted in Newsletter #116 , the cache coherence call becomes a no-op on the T3E, but should remain for T3D portability. The only change to this code in porting to the T3E was in the include preprocessor commands (for details, see the next article).


/*-------------------------------------------------------------*/
#include <mpp/shmem.h>

#ifdef T3D
#include <mpp/stdio.h>
#include <mpp/limits.h>
#include <mpp/time.h>
#endif

#ifdef T3E
#include <stdio.h>
#include <limits.h>
#include <time.h>
#endif

#define BUFSZ 1000000

/* d=delta clock tics; n=number words*/
#define USECS(d)  ((float)(d)*1000000.0/(float)CLK_TCK)
#define MBR(d,n) ((float)(n)*8.0/1000000.0)/(USECS(d)/1000000.0))


main () {
  int         n;
  long        nwords;
  int         mype, otherpe, npes; 
  long        t1,t2;
  long        buf[BUFSZ];
  fortran     irtc();

  npes = shmem_n_pes();
  if (npes < 2) {
    printf ("ERROR: Minimum of 2 PEs required.\n");
    exit (1);
  }  

  for (n = 0; n < BUFSZ; n++) 
    buf[n] = 0; 
        
  mype = shmem_my_pe();
  shmem_set_cache_inv();             /* Reload cache when put received */

  switch (mype) {
    case 0:
      otherpe = 1;

      for (nwords = 1; nwords <= BUFSZ; nwords*=10) {
        buf[nwords-1] = 1;    /* Use as flag to shmem_wait() on receiver */

        barrier(); 

        t1 = irtc ();
        shmem_put (buf, buf, nwords, otherpe);
        t2 = irtc ();

        printf ("SENDER: nwords=%ld ticks=%ld usecs=%f\n", 
          nwords, (t2-t1), USECS(t2-t1));
      }
      
      break;


    case 1:
      for (nwords = 1; nwords <= BUFSZ; nwords*=10) {
        barrier(); 

        t1 = irtc ();
        shmem_wait( &buf[nwords-1], 0 );    /* wait for last element */
        t2 = irtc ();

        printf ("RECEIVER: nwords=%ld ticks=%ld usecs=%f; mbr=%f\n", 
          nwords, t2-t1,  USECS(t2-t1), MBR(t2-t1,nwords);
      }
      
      break;


    default:
      break;
  }
} /*-------------------------------------------------------------*/

Header Files

Some header files are in different locations on the T3E and T3D. It can be a little confusing, as this example (taken from the previous article) shows:


  #include <mpp/shmem.h>
  
  #ifdef T3D
  #include <mpp/stdio.h>
  #include <mpp/limits.h>
  #include <mpp/time.h>
  #endif
  
  #ifdef T3E
  #include <stdio.h>
  #include <limits.h>
  #include <time.h>
  #endif

One might ask: why were stdio.h , limits.h and time.h& moved out of the mpp directory, while shmem.h remained?

This question suggests, erroneously, that on the Y-MP, these .h files actually are in the same directory, mpp . It turns out that they're not: stdio.h, limits.h, time.h (and about 135 other .h files) are system include files and are found under the /usr/include/ tree, while shmem.h is part of the programming environment, and is under the /opt/ctl/ tree. For example:


-------------------- limits.h

  On Y-MP:
    denali$ find /usr/include -name limits.h -print
    /usr/include/mpp/limits.h
    /usr/include/limits.h
  
  On T3E:
    yukon$ find /usr/include -name limits.h -print
    /usr/include/limits.h
  
  The Y-MP has two "limits.h" files because it is a single host shared
  by two architectures.  Life is simple again on the T3E, the
  subdirectory, "mpp" is completely gone, and the contents of the
  Y-MP's /usr/include/mpp/ directory has been moved back down into
  /usr/include/ (with the necessary exception or two -- see mpi.h,
  below).

-------------------- shmem.h

On Y-MP:
  denali$  find /opt -name shmem.h -print
  /opt/ctl/craylibs_m/2.0.0.0/include/mpp/mpp/shmem.h
  /opt/ctl/craylibs_m/2.0.2.0/include/mpp/mpp/shmem.h

On T3E:
  yukon$ find /opt -name shmem\*.h -print
  /opt/ctl/craylibs/2.0.3.0/include/mpp/shmem.h
  /opt/ctl/craylibs/2.0.3.3/include/mpp/shmem.h


  The shared memory archive and headers are not considered "system"
  files, but, rather, part of the programming environment ("product"
  files). It looks like they've moved (again, an "mpp" subdirectory has
  fallen out), but this is a transparent change:  PE 2.0's environment
  sets up the paths to these headers for us.

-------------------- mpi.h

On Y-MP:
  denali$ find /usr/include -name mpi\*.h -print 
  /usr/include/mpp/mpi.h
  /usr/include/mpp/mpif.h

On T3E:  
  yukon$  find /usr/include -name mpi\*.h -print 
  yukon$  find /opt/ctl -name mpi\*.h -print 
  /opt/ctl/mpt/1.1.0.0/include/mpi.h
  /opt/ctl/mpt/1.1.0.0/include/mpif.h 
  
  On the Y-MP, mpi headers were installed in /usr/include/mpp/. On the
  T3E, mpi is part of the "message passing toolkit," or MPT, which is
  part of the programming environment.  Thus, the mpi headers have
  moved from /usr/include/mpp/ over to /opt/ctl/.

  As with shmem.h, the module command sets up the environment so that
  the programmer doesn't need to worry about the location of these
  files.  Remember to load the MPT though, with this command:
    module load mpt

This presentation makes the situation seem more confusing than it really is. Except for the mpp system headers the changes should be transparent, and the module command makes it trivial to switch from one version to another.

Our standard advice: 1) Don't use absolute paths to anything. 2) Get comfortable with the "module" command -- it is powerful and easy to use.

mppview

For now, the "ascii graphics," dots and words display of system activity, which was available through mppview on the Y-MP is unavailable on the T3E. For users with X displays, "xmppview" is a colorful 3-Dimensional replacement that beckons us to get virtual reality goggles and voyage inside the rotating torus.

Graphics aside, however, the ASCII info you really need is still available on the T3E though good old mppview. "mppview -s queue" tells you what's running and waiting; "mppview -s config" gives PE configuration. For example:

yukon$ mppview -s queue


 
  ********** MPP Application Queue Stats **********
  
  Applications Running: 3
  
      UID  PPID     APID     Run Time  PEs  Base  Command
     ----- ----- ---------- -------- ----- ----- ----------
       162  7838 0x0186dc13 06:12:57    64     0 a.out.64
      1235 12886 0x0145f415 00:03:19    11    64 route
      1235 12924 0x01459c13 00:00:10     2    75 shmem_wait
  
  Applications Queued: 1
  
      UID  PPID    APID      Q Time    PEs  Command    Reason
     ----- ----- ---------- -------- ----- ---------- ----------
      1235 12909 0x0145bc13 00:02:07    10 iompp      ApLimit 
  
  
  yukon$ mppview -s queue
  
  ********** PE Configuration **********
  
  Total PEs configured: 96
  
  OS PEs configured: 3
    LPE#  Name             MB   MHz
    ----- ---------------- ---- ----
    0x058 ospe_b            128  300
    0x059 ospe_c            128  300
    0x05f ospe_a            128  300
  
  Command PEs configured: 9
    LPE#  MB   MHz 
    ----- ---- ----
    0x054  128  300
    0x055  128  300
    0x056  128  300
    0x057  128  300
    0x05a  128  300
    0x05b  128  300
    0x05c  128  300
    0x05d  128  300
    0x05e  128  300
  
  Application PEs configured: 84
  Application PE types found: 1
         MB    MHz  #PEs
        ----- ----- -----
     1)   128   300    84
  
  Application regions configured: 1
        Min PEs Max PEs #In Use
        ------- ------- -------
     1)       2      84      77

T3D/T3E Differences

This will be a semi-regular feature, following the example of the "T3D/YMP Differences" list which appears in earlier issues of the T3D Newsletter. As we write up specific differences between the T3D and T3E, we'll append them to this list for quick reference, but for a general overview available immediately, please see Newsletter #116. Whenever you find something, send it in so we can learn from each others' experience. The current list:

  1. Some system header files moved (Newsletter #117).
  2. mppview differences (Newsletter #117).

Quick-Tip Q & A


A: {{ In Cray's Programming Environment 2.0, how can you tell what versions
      of libraries, compilers, etc... will be used as the current default? }}

   # ARSC users may run the "PEvers" script.  Here some sample output: 

    yukon$ PEvers
    The following Programming Environment Packages are installed:
    =============================================================
    /opt/ctl/cf90
            2.0.3.0
            2.0.3.3
    The current default version is //opt/ctl/cf90/2.0.3.3.
    =============================================================
    /opt/ctl/CC
            2.0.3.0
            2.0.3.3
    The current default version is //opt/ctl/CC/2.0.3.3.
    =============================================================
    /opt/ctl/craytools
            2.0.3.0
            2.0.3.4
    The current default version is //opt/ctl/craytools/2.0.3.4.
    =============================================================
    etc... 
    etc...


   # The alternative is manually take a peek at the PE2.0 products:

   ls -l /opt/ctl        # Each directory listed is a product name.
   ls -l /opt/ctl/$PROD  # Where $PROD is a product name

   # In this second listing, each subdirectory is an installed version while  
   # the link identifies the default version. For example:

    yukon$ ls -l /opt/ctl
    total 96
    drwxr-xr-x   4 bin      bin   4096 Apr  1 19:15 CC
    drwxr-xr-x   3 bin      bin   4096 Mar 12 00:41 CCmathlib
    drwxr-xr-x   3 bin      bin   4096 Mar 12 00:44 CCtoollib
    drwxr-xr-x   6 bin      bin   4096 Mar 12 00:13 admin
    drwxr-xr-x   2 bin      bin   4096 Apr  1 19:15 bin
    drwxr-xr-x   3 bin      bin   4096 Mar 12 00:57 cam
    drwxr-xr-x   4 bin      bin   4096 Apr  1 19:17 cf90
    drwxr-xr-x   4 bin      bin   4096 Apr  1 19:12 craylibs
    drwxr-xr-x   4 bin      bin   4096 Apr  1 19:22 craytools
    drwxr-xr-x   3 bin      bin   4096 Mar 12 01:23 cvt
    drwxr-xr-x   2 bin      bin   4096 Apr  1 19:23 doc
    drwxr-xr-x   3 bin      bin   4096 Mar 12 00:13 mpt
    yukon$ ls -l /opt/ctl/cf90
    total 24
    drwxr-xr-x   6 bin      bin   4096 Mar 12 00:29 2.0.3.0
    drwxr-xr-x   6 bin      bin   4096 Apr  1 19:17 2.0.3.3
    lrwxrwxrwx   1 root     bin     22 Apr  1 19:17 cf90 ->  
                                               /opt/ctl/cf90/2.0.3.3

   # This shows that two versions of cf90 are installed: 2.0.3.0 and 
   # 2.0.3.3, and that the default is 2.0.3.3.
  

  Q: How can you capture a "man" page without getting all the formatting
     characters?  Say you wanted ASCII text for a newsletter.

[ Answers, questions, and tips graciously accepted. ]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top