ARSC T3D Users' Newsletter 106, September 27, 1996

T3D/T3E Timings

[ One of our T3D users, Dr. Alan Wallcraft of Stennis Space Center, contributes this article. ]

I have a couple of ocean model benchmark codes which Cray ran on both the T3D and T3E. The original used Fortran 77 and PVM message passing. Vendors are allowed to modify the code, but I don't know to what extent Cray did so. One change they made was to use the Fortran 90 compiler and REAL*4 throughout, since ocean models typically run with 32-bit REALs. The benchmarks use the same basic code but at two sizes, (i) a small 1/2 degree global ocean case, and (ii) a much larger 1/16th degree global case. Since the T3E typically has twice as much memory per node as the T3D, Cray ran on half as many nodes on the T3E. This makes direct node for node comparisons trickier, but even assuming only a 1.8x speedup when doubling the number of nodes (which is about the minimum expected for this kind of code) the T3E is about 3.3x faster than the T3D. This is a very good speedup, particularly since the compiler is presumably still better tuned for the T3D than the T3E (i.e. the T3E compiler will probably improve as it matures). Peak T3E hardware performance is 4x the T3D (600 Mflops vs 150 Mflops).


 
                  #T3D    WALL   #T3E   WALL   SPEED-UP   SPEED-UP
                  NODES   TIME   NODES  TIME   WALL TIME  PER NODE
 
      OCEANS-02     32     440     16    249     1.77x      3.2x
 
      OCEANS-16    128    1350     64    734     1.84x      3.3x

Barriers Revisited

I have found that my intuition regarding the TEST_BARRIER, SET_BARRIER, and WAIT_BARRIER functions was off (see the next article). I've done some more work and come up with a better understanding.

By way of introduction, what do you think the following program will do if compiled for four PEs and run on the T3D?


ccccccccccccccccccccccccccccccc
      program wait1
      intrinsic MY_PE
      
      call WAIT_BARRIER ()
      print*, MY_PE(), " done."
      end
ccccccc

It terminates normally:


  denali$ a.out
   0 done.
   2 done.
   3 done.
   1 done.

How about this?


ccccccccccccccccccccccccccccccc
      program wait2
      intrinsic MY_PE
      logical TEST_BARRIER
      
      print*, MY_PE(), " Before: TEST_BARRIER()= ", TEST_BARRIER()
                            
      if (MY_PE() .EQ. 0) then 
        call SET_BARRIER ()
      else
        call sleep (5)
        call WAIT_BARRIER ()
      endif
      
      print*, MY_PE(), " After: TEST_BARRIER()= ", TEST_BARRIER()
      end
ccccccc

It terminates normally, with this output:


  denali$ a.out
   1 Before: TEST_BARRIER()= T
   3 Before: TEST_BARRIER()= T
   2 Before: TEST_BARRIER()= T
   0 Before: TEST_BARRIER()= T
   0 After: TEST_BARRIER()= F
   1 After: TEST_BARRIER()= T
   3 After: TEST_BARRIER()= T
   2 After: TEST_BARRIER()= T

And this?


ccccccccccccccccccccccccccccccc
      program barr1
      intrinsic MY_PE
      
      if (MY_PE() .EQ. 0) then 
        call sleep (5)
        call barrier ()
        call barrier ()
      else
        call SET_BARRIER ()
        call SET_BARRIER ()
        call SET_BARRIER ()
        call SET_BARRIER ()
        call barrier ()
      endif
      
      print*, MY_PE(), " done."
      end
ccccccc

It hangs on the second barrier call in PE0, and must be interrupted:


  denali$ a.out
   2 done.
   1 done.
   3 done.
  Interrupt
  
   Beginning of Traceback (PE 0):
    Started from address 0x20000c0514 in routine '_sma_deadlock_wait'.
    Called from line 78 (address 0x20000c0720) in routine 'barrier'.
    Called from line 8 (address 0x20000001a4) in routine 'WAIT'.
    Called from line 363 (address 0x2000003e08) in routine '$START$'.
   End of Traceback.

---

I wrote these little programs to demonstrate specific features of the barrier calls. Here are some general observations followed by a description of barrier states and transitions.

Observations:

  • Setting a barrier on one PE does NOT set it on any other PEs.
  • The SET_BARRIER() call sets the local barrier unless all of the other PEs have already set their barriers, in which case, it clears the barriers on all PEs, all at once. On the T3D, this takes place in hardware, and is extremely fast, regardless of the number of PEs involved.
  • While a barrier is set, extra calls to SET_BARRIER() have no effect and are not enqueued for later.
  • Calling TEST_BARRIER() or WAIT_BARRIER() has no effect on the state of the barrier.
  • The basic "BARRIER()" call is probably safer and easier to use than SET, WAIT, and TEST.

STATE DESCRIPTION FOR T3D BARRIER FUNCTIONS FROM THE POINT OF VIEW OF THE LOCAL PE:


-------------------------------------------------------

The local barrier can be in one of two states:

  1. CLEAR: in which a "WAIT_BARRIER()" call will NOT block and a "TEST_BARRIER()" call will return TRUE.
  2. SET: in which a "WAIT_BARRIER()" call WILL block and a "TEST_BARRIER()" call will return FALSE.

Start state:
  CLEAR:     all programs start with all PE processes in the barrier
             state, CLEAR.


Transitions:
  1) CLEAR ==> CLEAR: any other (i.e., non-local) PE calls
                  "SET_BARRIER()."

  2) CLEAR ==> SET: occurs when the local PE calls "SET_BARRIER()."
                    
  3) SET ==> SET: any other PE, except for the last remaining PE, calls 
                  "SET_BARRIER()."
 
  4) SET ==> SET: any other PE which has already called "SET_BARRIER()"
                  calls it again.

  5) SET ==> SET: the local PE calls "SET_BARRIER()" again.

  6) SET ==> CLEAR: occurs when the last remaining PE calls "SET_BARRIER()".
                  This could be the local PE, in which case, the transitions
                  from CLEAR ==> SET and back from SET ==> CLEAR happen
                  atomically.

-------------------------------------------------------

For anyone who is interested, here is my main testing program. It lets you set barriers on different PEs at different times, and overlap barrier segments, as you wish.


#######################################################
      Program barrier_tests
      implicit none
      integer MY_PE, tnum, nspins
      real dummy
      intrinsic MY_PE
      logical TEST_BARRIER, tb
      character*35 barStat
       
      dummy = 0

      if (N$PES .NE. 4) then 
        stop "NPES must equal 4."
      endif

      call barrier ()

      do tnum = 1, 36
        call spin (dummy, 1)

        if (MY_PE() .EQ. 0) then 
          if (tnum .EQ. 8) call SET_BARRIER ()
          if (tnum .EQ. 16) call WAIT_BARRIER ()
          if (tnum .EQ. 28) call SET_BARRIER ()

          write (6, 1010) MY_PE(), tnum, " barrier is: ", barStat ()
          call flush (6)

        else if (MY_PE() .EQ. 1) then 
          if (tnum .EQ. 8) call SET_BARRIER ()
          if (tnum .EQ. 16) call WAIT_BARRIER ()
          if (tnum .EQ. 24) call SET_BARRIER ()

          write (6, 1010) MY_PE(), tnum, " barrier is: ", barStat ()
          call flush (6)

        else if (MY_PE() .EQ. 2) then 
          if (tnum .EQ. 8) call SET_BARRIER ()
          if (tnum .EQ. 12) call WAIT_BARRIER ()
          if (tnum .EQ. 24) call SET_BARRIER ()

          if (tnum .EQ. 30) call SET_BARRIER ()

          write (6, 1010) MY_PE(), tnum, " barrier is: ", barStat ()
          call flush (6)

        else if (MY_PE() .EQ. 3) then 

          if (tnum .EQ. 4) call SET_BARRIER ()
          if (tnum .EQ. 10) call WAIT_BARRIER ()
          if (tnum .EQ. 20) call SET_BARRIER () 

          ! Does not block because set (on this PE) not called first
          if (tnum .EQ. 32) call WAIT_BARRIER ()  

          if (tnum .EQ. 34) call SET_BARRIER ()

          ! Blocks because of prior call to set
          if (tnum .EQ. 35) call WAIT_BARRIER ()  

          write (6, 1010) MY_PE(), tnum, " barrier is: ", barStat ()
          call flush (6)

        endif
      enddo
      
      write (6, 1000) "PE", MY_PE(), " DONE" 
      call flush (6)

1000  format (a,i3,a)
1010  format (i3,i3,a,a)
      end
ccccccccccccccccccccccccccccccccccccc
      character*35 function barStat ()
        implicit none
        logical TEST_BARRIER

        if (TEST_BARRIER()) then
          barStat = "CLEAR "
        else
          barStat = "SET - WAIT_BARRIER() would block"
        endif
      end
ccccccccccccccccccccccccccccccccccccc
      subroutine spin (dummy, nspins)
        implicit none
        integer i, spinnum, nspins
        real dummy, ranf
        intrinsic ranf

        do spinnum = 1, nspins
          do i = 1,100000
             dummy = dummy + ranf () 
          enddo
        enddo
      end
cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc

Corrected Barrier Timing Program

In last week's Newsletter, I presented a program to do a timing comparison between eurekas and barriers. As noted, the program occasionally hangs. I speculated: "I think the problem is that the barrier calls are too close together, and once in a while a "set" goes undetected at some process' "wait." Sure enough, as explained in the preceding article, extra SET_BARRIER() calls have no effect, and thus I had introduced a synchronization bug into the program.

Frank Chism of CRI rightfully called me on this, and sent in email and a debugged version of the program. Here is an part of Frank's message and the program:

> 
> Use of the fully asynchronous barrier routines can introduce very subtle
> bugs if you are not careful to be sure that all branches fully satisfy
> not only the setting of barriers but properly wait for them to complete
> before allowing termination.  In your case you made sure that all PEs
> set barriers when they should,  but because you did not wait for all of
> them to complete it was possible for some PEs to terminate before the
> last barrier was satisfied.
> 
> Frank
> 
> -------------------------------------------------------------------
> Here is a diff between the original test you published and the corrected
> one I ran.  Also I have included the entire working test.
> 
> -------------------------------------------------------------------------
> 
> denali% diff eb.f eb.org
> 42,43d41
> < Chism debug: Note the wait barrier() part of this barrier is not
> < Chism        matched in the else branch of this if
> 54,57d51
> < Chism debug: barrier is set, but no matching wait_barrier for full
> < Chism        barrier call in the other branch of the if
> <           call wait_barrier()
> < Chism end debug
> 94,99c88
> < Chism debug: A call to set_barrier without a matching wait_barrier
> < Chism        does not match the other branches set_barrier barrier
> < Chism        combination.  I'll match a barrier with a barrier
> < Chism     call set_barrier ()
> <           call barrier()
> < Chism end debug
> ---
> >           call set_barrier ()
> 103,105d91
> < Chism Debug barrier problem at termination
> < Chism stop
> < Chism end debug
> denali% cat eb.f
>       Program barrier_timings
> 
>       implicit none 
>       integer trigger_PE         ! Which PE will now trigger event
>       integer mc(128)            ! Array to store system info 
>       integer MY_PE              ! Intrinsic function to get PE number 
>       integer mem_event          ! Shared variable for memory-mode event
>       real t1                    ! Temporary storage of start times 
>       real t2                    ! Temporary storage of end times 
>       real junk
>       real delay_start           ! For simulated work, start of spin
>       real irtc                  ! Internal function, clock ticks 
>       real cp                    ! Clock period in secs
>       logical test_event         ! Internal function
>       logical test_barrier       ! Internal function
>       intrinsic MY_PE
> cdir$ shared mem_event
> 
>       call gethmc (mc)
>       cp = mc(7) * 1.0e-12      ! convert picosecs to secs.
> 
> c
> c     Time event propagation when using eureka-mode events
> c
>       if (MY_PE() .EQ. N$PES-1) then 
>         write (6,1000) "EUREKA-MODE "
>         call flush (6)
>       endif
> 
>       do trigger_PE = 0, N$PES - 1
>         call clear_event ()        ! In Eureka mode, all PEs must clear
> 
>         call barrier ()  ! Make sure all PEs ready to watch for event
>         if (MY_PE() .EQ. trigger_PE) then
> 
>           ! Kill .1 secs to simulate some work
>           delay_start = irtc ()
> 5         if (irtc() .LT. delay_start + 0.1 / cp) goto 5
> 
>           t1 = irtc ()
>           call set_event ()         ! Trigger event
> Chism debug: Note the wait barrier() part of this barrier is not
> Chism        matched in the else branch of this if
>           call barrier ()           ! Wait till all PEs detect event
>           t2 = irtc ()
> 
>           write (6, 1010) MY_PE(), (t2-t1) * cp * 1e6
>           call flush (6)
>         else
> 10        if (.NOT. test_event()) goto 10
> 
>           ! Inform triggering PE that 1st barrier release was detected
>           call set_barrier ()
> Chism debug: barrier is set, but no matching wait_barrier for full
> Chism        barrier call in the other branch of the if
>           call wait_barrier()
> Chism end debug
>         endif
>       enddo  
> 
> 
> c
> c     Now use barrier
> c
> 
> 
>       if (MY_PE() .EQ. N$PES-1) then 
>         write (6,1000) "BARRIER"
>         call flush (6)
>       endif
> 
>       do trigger_PE = 0, N$PES - 1
> 
>         call barrier ()
>         if (MY_PE() .EQ. trigger_PE) then
> 
>           delay_start = irtc ()
> 105       if (irtc() .LT. delay_start + 0.1 / cp) goto 105
> 
>           t1 = irtc ()
>           call set_barrier ()      ! Trigger release of barrier
>           call barrier ()          ! Wait till all PEs detect release
>           t2 = irtc ()
> 
>           write (6, 1010) MY_PE(), (t2-t1) * cp * 1e6
>           call flush (6)
>         else
>           call set_barrier ()      ! All non-trigger PEs pass barrier
> 
>           ! Spin until trigger PE does its set_barrier
> 110       if (.NOT. test_barrier()) goto 110
>           
>           ! Inform triggering PE that 1st barrier release was detected
> Chism debug: A call to set_barrier without a matching wait_barrier
> Chism        does not match the other branches set_barrier barrier
> Chism        combination.  I'll match a barrier with a barrier
> Chism     call set_barrier ()
>           call barrier()
> Chism end debug
>         endif
>       enddo  
> 
> Chism Debug barrier problem at termination
> Chism stop
> Chism end debug
>       
> 1000  format (a,/,"Event_PE ", " Delay(usecs)")
> 1010  format (i4, "       ", f6.2)
>       end

This version produces output similar to that presented last week:


    EUREKA-MODE 
    Event_PE  Delay(usecs)
       0         9.25
       1         9.15
       2        10.10
       3         9.16
       4         9.12
       5         9.05
       6         9.07
       7         9.38
    BARRIER
    Event_PE  Delay(usecs)
       0         6.83
       1         6.60
       2         7.97
       3         7.81
       4         7.08
       5         7.30
       6         6.44
       7         6.67

Quick-Tip Q & A


A: {{ If you work at computers all day (for years), how can you reduce 
      eye-strain?  }}

    # When pressed, ARSC staff were happy to respond.  Some tips:

    - Every 5-60 minutes, look up from the screen and focus (for 
      several seconds) on something distant.
    - Use dark "wallpaper." On text screens, use dark background
      with bright foreground.  Here's a suggested xterm setting:
            WINTERM='xwsh -name winterm -fn 
              -*-screen-bold-r-normal--18-*-*-*-m-100-iso8859-1 
              -bg black -fg white -bold cyan'.
    - Use large fonts and sit further back.
    - Blink often.
    - Use eye-drops.
    - Roll your eyes.
    - Get a monitor lens. This is from a Cornell study (available at:
      
http://www.news.cornell.edu/Chronicles/5.2.96/filters.html
):

        "After using a glass anti-glare filter, the percentage of daily
        or weekly problems related to lethargy/tiredness, tired eyes,
        trouble focusing eyes, itching/watery eyes and dry eyes was
        half what they were before filter use for people who use
        computer monitors all day at work, said ergonomist Alan Hedge,
        professor of design and environmental analysis and director of
        the Human Factors Laboratory at Cornell."


Q: What's an easy way to remove all the 'core' and 'mppcore' files in
   any of your directories (but have 'rm' ask before removing)?

[ Answers, questions, and tips graciously accepted. ]
Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top