ARSC T3D Users' Newsletter 57, October 20, 1995

Sample NQS Jobs

As users move from development in the interactive mode to production mode, the NQS queues on the T3D are the way to go. At ARSC, the current NQS queues are:


  Queue name  Maximum  Maximum   Maximum number   When the 
                PEs    duration    of job in      queue is
                                  this queues      active

  m_8pe_24h       8    24 hours       2        everyday 24 hrs
  m_16pe_24h     16    24 hours       2        everyday 24 hrs  
  m_32pe_24h     32    24 hours       1        everyday 24 hrs
  m_64pe_10m     64    10 minutes     1        everyday 24 hrs
  m_64pe_24h     64    24 hours       1        everyday 24 hrs
  m_128pe_5m    128     5 minutes     1        everyday 24 hrs
  m_128pe_8h    128     8 hours       1        Fri 6PM-Sun 4AM
A typical NQS for a T3D run might look like:

#QSUB -eo               #this job will write the error output
                        # and the user output to a single file
#QSUB -q mpp            #this job will use the T3D
#QSUB -l mpp_p=128      #this job will use 128 PEs
#QSUB -l mpp_t=250      #this job will run for 250 seconds
setenv TARGET cray-t3d
setenv MPP_NPES 128
cd /tmp/ess/128
a.out 
This job will enter the NQS queues and execute as soon as all 128 PEs are free. If the file is called nqs.128, it will be submitted with the command:

  qsub nqs.128
and it will produce in the directory /tmp/ess/128 a file of output called:

  nqs.128.o#####
where ##### is the NQS identifier number for this nqs job. (This is the same number that identifies the job in the qstat display of all NQS jobs.) This job could also have been submitted as:

  #QSUB -eo               #this job will write the error output
                          # and the user output to a single file
  #QSUB -q m_128pe_5m     #this job will use the 128 PE, 5 minute queue 
  setenv TARGET cray-t3d
  setenv MPP_NPES 128
  cd /tmp/ess/128
  a.out 
with the same effect. The necessary manpages for running NQS jobs are:

  qsub  - submits the job to the queue
  qstat - monitors the job in the queues
  qdel  - kills a nqs job
A more complicated nqs script might be a file call nqs.all:

  #QSUB -eo               #this job will write the error output
                          # and the user output to a single file
  #QSUB -q mpp            #this job will use the T3D
  #QSUB -l mpp_p=128      #the maximum number of PEs used will be 128
  #QSUB -l mpp_t=250      #the max. seconds for all T3D jobs of this script
  #QSUB -l p_mpp_t=30     #the max. seconds for each T3D job in this script
  setenv TARGET cray-t3d
  cd /tmp/ess/128
  make                        #this make happens on the Y-MP
  setenv MPP_NPES 128
  a.out > results.128        # collect results on 128 PEs(don't go to nqs.all.o#####)
  setenv MPP_NPES 64
  a.out > results.64        # collect results on 64 PEs(don't go to nqs.all.o#####)
  setenv MPP_NPES 32
  a.out > results.32        # collect results on 32 PEs(don't go to nqs.all.o#####)
  setenv MPP_NPES 16
  a.out > results.16        # collect results on 16 PEs(don't go to nqs.all.o#####)
  setenv MPP_NPES 8
  a.out > results.8        # collect results on 8 PEs(don't go to nqs.all.o#####)
  setenv MPP_NPES 4
  a.out > results.4        # collect results on 4 PEs(don't go to nqs.all.o#####)
  setenv MPP_NPES 2        
  a.out > results.2        # collect results on 2 PEs(don't go to nqs.all.o#####)
  setenv MPP_NPES 1
  a.out > results.1        # collect results on 1 PE(don't go to nqs.all.o##### )
  echo 'all done'

Timing the Cache Management Functions

Using the SHMEM routines require a certain amount of synchronization and cache management to make sure that the user is getting the correct answer. But of course there is a cost with using functions like shmem_udcflush and shmem_udcflush_line to do cache management. Below is the program I used to measure the times for these functions:

  main()
  {
    int MYPE,NPES;
    int space[128];
    int i1,i2,i3,i4;
    fortran irtc();

    MYPE = _my_pe();
    NPES = _num_pes();

    shmem_barrier(0,1,NPES,space);
    if (MYPE == 0) {
    } else {
      i1 = irtc();
         shmem_udcflush();
      i2 = irtc();
      i3 = irtc();
      i4 = irtc();
      printf(" cachcost in clocks %d\n",(i2-i1) - (i4-i3));
      printf(" overhead in clocks %d\n",i4-i3);
      i1 = irtc();
         shmem_udcflush_line(space);
      i2 = irtc();
      i3 = irtc();
      i4 = irtc();
      printf(" linecost in clocks %d\n",(i2-i1) - (i4-i3));
      printf(" lineover in clocks %d\n",i4-i3);
    }
    shmem_barrier(0,1,NPES,space);
  }
For the T3D, with a clock period of 6.67 nanoseconds (1/150Mhz), the program above got the following timings for about 100 runs:

                   minimum in cps  maximum in cps  average cps

  shmem_udcflush         503             578           530
  shmem_udcflush_line    256             307           270
The SHMEM technical note for C (Version 2.3) says that the approximate time for shmem_udcflush_line should be 27 cps and for shmem_udcflush 230 cps. In the next newsletter, I will present some results on synchronizing SHMEMs.

The Problem with SHMEM gets

In Newsletter #53 (09/22/95) there was a report about a bug with the SHMEM_GET routines in the 1.2 PE. It is disconcerting that shmem_get has this problem because shmem_put is considered the more dangerous function. This is because shmem_put requires programmer-assured synchronization. Below is the replication of the shmem_get bug on the ARSC T3D with calls to the C shmem_get functions:

  main()
  {
    int i,j,NPES,MYPE;
    long space[128];
    double a[8];
    double b[8]; 
    double tmp;
  #pragma _CRI cache_align a,b

    NPES = _num_pes();
    MYPE = _my_pe();

    for (i = 0; i < 8; i++) {
      a[i] = -1.0;
      b[i] = _my_pe();
    }
    shmem_barrier(0,1,NPES,space);
    if (MYPE == 0) {
  /* case 1 */
      shmem_get(a,&b[3],4,1);
      shmem_get(a,&b[3],2,2);
      printf(" a[0] = %f\n", a[0]);
      printf(" a[1] = %f\n", a[1]);
      printf(" a[2] = %f\n", a[2]);
      printf(" a[3] = %f\n", a[3]);
  /* case 2 */
      shmem_get(a,&b[2],4,1);
      shmem_get(a,&b[2],3,2);
      printf(" a[0] = %f\n", a[0]);
      printf(" a[1] = %f\n", a[1]);
      printf(" a[2] = %f\n", a[2]);
      printf(" a[3] = %f\n", a[3]);
  /* case 3 */
      shmem_get(a,&b[2],4,1);
      shmem_udcflush(); 
      shmem_get(a,&b[2],3,2);
      printf(" a[0] = %f\n", a[0]);
      printf(" a[1] = %f\n", a[1]);
      printf(" a[2] = %f\n", a[2]);
      printf(" a[3] = %f\n", a[3]);
      printf(" a[4] = %f\n", a[4]);
      printf(" a[5] = %f\n", a[5]);
    } else {
    }
    shmem_barrier(0,1,NPES,space);
  }
The output from this test program is:

  case 1
   a[0] =  2.000000
   a[1] =  1.000000  should be 2.000000
   a[2] =  1.000000
   a[3] =  1.000000
  case 2
   a[0] =  2.000000
   a[1] =  2.000000
   a[2] =  1.000000  should be 2.000000
   a[3] =  1.000000
  case 3 (correct because of the shmem_udcflush() function
          between successive calls)
   a[0] =  2.000000
   a[1] =  2.000000
   a[2] =  2.000000
   a[3] =  1.000000
   a[4] = -1.000000
   a[5] = -1.000000
I have experimented with different uses of shmem_get and it takes a lot of conditions for this error to appear:
  1. consecutive calls to shmem_get
  2. gets from different PEs
  3. gets to the same address
  4. gets of 2 or 3 words of data
  5. gets span a 4 word cache line
The code for case 3 above shows a simple fix for these consecutive calls: a shmem_udcflush() between them. At ARSC, we will try to upgrade to the 1.2.2.3 PE as soon as the UNICOS upgrade is finished.

How are SHMEM PUTS Implemented?

The timings in last week's newsletter showed that the shmem_puts are not totally non-blocking operations. One reader wrote in to comment that originally CRI planned to implement shmem_put with the BLT hardware (Block Transfer Engine) and that would have made them totally non-blocking. It seems that the actual implementation uses the processor, not the BLT and therefore the shmem_puts are not non-blocking operations (although still requiring synchronization). Does anyone know more about how the shmem_puts are implemented? I would like to get out the details of how much of a nonblocking operation it really is.

Reading and Writing IEEE and Y-MP Files on the Y-MP and T3D

I have had two requests in the past week about reading Y-MP files on the T3D and reading T3D files on the Y-MP. If you have any examples, I would like to see them. I will be putting together the examples for the next newsletter. Thanks.

List of Differences Between T3D and Y-MP

The current list of differences between the T3D and the Y-MP is:
  1. Data type sizes are not the same (Newsletter #5)
  2. Uninitialized variables are different (Newsletter #6)
  3. The effect of the -a static compiler switch (Newsletter #7)
  4. There is no GETENV on the T3D (Newsletter #8)
  5. Missing routine SMACH on T3D (Newsletter #9)
  6. Different Arithmetics (Newsletter #9)
  7. Different clock granularities for gettimeofday (Newsletter #11)
  8. Restrictions on record length for direct I/O files (Newsletter #19)
  9. Implied DO loop is not "vectorized" on the T3D (Newsletter #20)
  10. Missing Linpack and Eispack routines in libsci (Newsletter #25)
  11. F90 manual for Y-MP, no manual for T3D (Newsletter #31)
  12. RANF() and its manpage differ between machines (Newsletter #37)
  13. CRAY2IEG is available only on the Y-MP (Newsletter #40)
  14. Missing sort routines on the T3D (Newsletter #41)
  15. Missing compiler allocation flags (Newsletter #52)
  16. Missing compiler listing flags (Newsletter #53)
I encourage users to e-mail in differences that they have found, so we all can benefit from each other's experience.
Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top