ARSC HPC Users' Newsletter 283, December 19, 2003

Iceberg Status Update

As announced earlier in,

/arsc/support/news/hpcnews/hpcnews265/index.xml and /arsc/support/news/hpcnews/hpcnews279/index.xml ,

ARSC is installing a large IBM p655+/p690+ cluster, "iceberg," with IBM's new Federation Switch technology for the server interconnect. The switch was successfully installed early this month, and is now under testing by ARSC staff.

A schedule for pioneer user access will be announced later.

A "Universal" High Performance Code?

[ Thanks to Jeff McAllister for this article and code. ]

Much is said about the benefits of one architecture over another. However, from the standpoint of writing code, extensive optimization for a specific machine is often not the best use of effort.

I hoped to find some concepts which could endure beyond a product lifecycle. I set out to write a simple, portable, distributed memory multiprocessor code free of specific architecture optimizations yet able to achieve near-peak performance. For inspiration I started with Guy Robinson's hard-to-beat "gflop" code from HPC Newsletter 213 .

What I ended up with is an MPI midpoint-rule integral solver. Code is included below. The function this version integrates is y=.5x+1 from 0 to 2000. The result should be very close to 1002000, regardless of the number of processors used. (Integrals make nice performance tests because they can generate a lot of work, each step can be done independently, and results are easily predictable.)

As I was looking for operations/second, I manually counted 12 adds/multiplies per loop, including incrementing the counter. Compiler optimizations like unrolling will affect the actual instruction count so the compute rates it generates should be considered estimates. However, the results are usually similar to actual CPU counter results reported by the various vendor tools, see:

Cray X1 "pat_hwpc", below; IBM's "hpmcount", HPC newsletter 251 ; Cray PVP "hpm", HPC newsletter 207 ; Cray MPP "pat", T3E newsletter 172 .

Timing, always a sticky issue, is handled with MPI_Wtime. This might not always provide the best granularity possible, but it's portable.

As you can see, the attempt at universal performance had mixed results:


                                  1 CPU MFLOPS
                          -----------------------------------
system    CPU             achieved   peak   %of peak achieved
======    =============   ========  =====  ==================      
iceflyer  1.7 GHz Power4     6066    6800       89
klondike  X1 MSP             8248   12800       64              
klondike  X1 SSP             2255    3200       70
chilkoot  500 MHz SV1ex      1513    2000       76
yukon     450 MHz DEC         678     900       75
ambler    300 MHz R12k        493     600       82
quest     333 MHz Pent II     144     333       43

Consistent high performance is elusive. The code is quite portable and should run wherever Fortran 90 and MPI are available. IO and memory access are not bottlenecks in this code. On vector machines this code vectorizes perfectly. On cache machines it has about as much locality as you can get. The number of memory locations necessary is so low the variables should remain in CPU registers without ever needing to access even L1 cache. Even so, peak is still far away for some platforms.

Compiler options may help. On the Power4 machines, for example, this code runs almost twice as fast when compiled with -O4 as with -O3. The -O5 option is not best in this case. And on the T3E, performance improved from 207 to 678 MFLOPS with: "ftn -Oscalar3,aggress,bl,pipeline3,split2,unroll2 integral_mpi.f90".  (My default on all the other platforms is -O3.) More time with the compiler options may lead to similar improvement, especially in the cases farthest from peak.

However, I'm not convinced it will be so easy. While originally developed for our Cray X1, this code's performance showed a definite spike when moved to the IBM systems. (For other codes, the reverse could just as easily be true.) Probably the main reason this program does so well on the Power4 chips has to do with a lucky match between the architecture and the algorithm. As this is a midpoint integral solver, the main kernel has a lot of multiplies and adds in succession:


   do i=0,nsteps
     x1=(i*interval)+a1
     x2=((i+1)*interval)+a1
     xmid=(x1+x2)*.5
     y=(xmid*.5)+1.0
     sum1=sum1+(y*interval)
   end do

Multiply-add (FMA) just happens to be a single hardware operation on the Power4. When the compiler can represent code with this instruction, two operations occur in one cycle.

Clearly, some algorithms are a better fundamental match to some architectures than others. Guy Robinson's original "gflop" still gets closer to peak on the Cray vector systems, though this code performs better on the IBMs (and, presumably, on a wider variety of MPP and vector systems). As another argument in this code's favor, it could be more easily rewritten to match the special features of any platform, possibly by just changing the function it integrates.

The search for a universal strategy to demostrate and achieve high performance continues. Fortunately, just as there is a wide variety of codes, there is a wide variety of machines to run them.

Here is the program:


program integral
  implicit none
  include 'mpif.h'
  
  !-----------------------------------------
  ! declare variables
  !-----------------------------------------
  integer::nsteps,i,nparts,part,master,mype,ierr,totpes,ops_per_loop
  real(kind=8)::sum1,interval,x1,x2,xmid,y,a,b,area,a1,b1,fullsum
  real(kind=4)::sumbuf
  real(kind=4),allocatable,dimension(:)::psum
  integer,allocatable,dimension(:)::pstart,pend
  double precision::time,time1, time2,total_ops,total_loops
  
  !-----------------------------------------
  ! initialize MPI
  !-----------------------------------------
  master=0 
  call mpi_init(ierr)
  call mpi_comm_rank(mpi_comm_world, mype, ierr)
  call mpi_comm_size(mpi_comm_world, totpes, ierr)
  
  
  !-----------------------------------------
  ! partition the integration steps
  !-----------------------------------------
  nparts=totpes
  part=mype  
  allocate(pstart(0:nparts-1),pend(0:nparts-1))
  
  interval=.000001
  a=0.0
  b=2000.0
  
  call partitiontable(int(b-a),nparts,pstart,pend)
  a1=pstart(part)+a-1
  b1=pend(part)+a
  nsteps=((b1-a1)/interval)
  
  
  !-----------------------------------------
  ! KERNEL
  ! each proc calculates the integral
  ! from a1 to b1 -- only this is timed
  !-----------------------------------------
  sum1=0
  ops_per_loop=12
  time1=mpi_wtime()
  do i=0,nsteps
     x1=(i*interval)+a1
     x2=((i+1)*interval)+a1
     xmid=(x1+x2)*.5
     
     !-----------------------
     ! function integrated
     !-----------------------
     y=(xmid*.5)+1.0
     
     sum1=sum1+(y*interval)
  end do
  time2=mpi_wtime()
  !-----------------------------------------
  ! END KERNEL
  !-----------------------------------------
  
  

  !-----------------------------------------
  ! partial sums are gathered to master
  !-----------------------------------------
  if (mype==master) allocate(psum(totpes))
  sumbuf=sum1
  call mpi_gather(sumbuf,1,mpi_real,psum,1,mpi_real,master,mpi_comm_world,ierr)
  
  
  
  !-----------------------------------------
  ! master prints results
  !-----------------------------------------
  if (mype==master) then
     print ("(A16,x,F15.3)"),"result:",sum(psum)
     
     total_loops=(b-a)/interval
     total_ops=total_loops*ops_per_loop
     
     print ("(A16,x,F15.0)"),"total ops:",total_ops 
     
     time=time2-time1
     print ("(A16,x,F15.1)"),"elapsed time:",time
     print ("(A16,x,F15.1)"),"MFLOPS:",total_ops/time/1000000.0
     
  end if
  
  call mpi_finalize(ierr)
  
end program integral


! ----------------------------------------------
! This is a generic subroutine to generate a
! partition table over the range 1:MAX.
! There are probably shorter ways to do this.
! ----------------------------------------------  
SUBROUTINE partitiontable(MAX,totalPEs,Pstart,Pend)
  IMPLICIT NONE

  ! -------- SCALAR VARIABLES -------
  INTEGER MAX 
  INTEGER totalPEs 
  INTEGER I
  INTEGER DIFF1,DIFF2  
  INTEGER LastOne  
  REAL LOOP_INC 
  REAL POS   
  ! -------- DIMENSIONED VARIABLES-----
  INTEGER Pstart(0:totalPEs-1)
  INTEGER Pend(0:totalPEs-1)


  ! Set up the partition increments
  pos=0
  loop_inc=real(MAX-1)/real(totalPEs)
  LastOne=totalPEs-1
  

  ! Create the initial partition table    
  DO I=0,LastOne
     pos=pos+1.
     Pstart(I)=pos
     pos=pos+loop_inc-1
     Pend(I)=pos
  ENDDO

  ! Correct for imbalances caused by truncation of division
  ! for the increments by widening the first partition by one
  
  IF (totalpes.ge.2) THEN
     Pend(0)=Pend(0)+1
     DO I=1,LastOne
        Pstart(i)=Pstart(i)+1
        Pend(i)=Pend(i)+1
     ENDdo
  ENDif

  if (Pend(LastOne).ne.MAX)  Pend(LastOne)=MAX
  
  
  RETURN
END SUBROUTINE partitiontable

Basic X1 Optimization Tools: pat_hwpc, loopmarks, cray_pat

Once your code is running correctly on the X1, you may want to assess its performance to determine if it needs to be speeded up.

The three basic X1 performance analysis tools are:

  1. pat_hwpc
  2. compiler loopmark listing
  3. cray_pat

pat_hwpc

This tool reports data from the X1 hardware performance counters. The data is collected over the entire run of the program, and thus can't point you to a specific subroutine or loop which might need attention. It has no effect on the performance of your code, and running it requires no recompilation or relinking. To use it, preface your execution command with "pat_hwpc," as follows:


  % pat_hwpc ./a.out 
or for an MPI job:

  % pat_hwpc mpirun -np 8 ./a.out 

This tool works in most cases, is trivial to use, and provides invaluable data.

The following pat_hwpc output is from an actual user application (140k lines of source). It shows us that 98% of the operations are vector (this is good!), vector length is 44 (okay), computational intensity is 4.5 flops per load (good), and performance is 1.4 GFLOPS (okay). This is acceptable, but if the user were planning multiple long production runs we'd want to dig deeper for possible improvement.

Here's the pat_hwpc output:


  Exit status       0
  Host name & type  klondike crayx1 400 MHz
  Operating system  UNICOS/mp 2.3.13 12011516
  Text page size    16 Mbytes
  Other page size   16 Mbytes
  Start time        Fri Dec 19 14:46:09 2003
  End time          Fri Dec 19 14:54:28 2003
  
  Elapsed time      499.016 seconds
  User time         425.327 seconds   85%
  System time        58.844 seconds   12%
  
  Logical pe: 0  Node: 16  PID: 2414
  
  Process resource usage:
    User time    425.320431 seconds
    System time   58.765827 seconds
  
          P counter data
  CPU Seconds                              462.657200 sec
  Cycles                  1286.753M/sec  595325673264 cycles
  Instructions graduated   204.432M/sec   94582156093 instr
  Branches & Jumps           8.882M/sec    4109224077 instr
  Branches mispredicted      0.651M/sec     301007611 misses    7.325%
  Correctly predicted        8.231M/sec    3808216466 misses   92.675%
  Vector instructions       37.922M/sec   17544896116 instr    18.550%
  Scalar instructions      166.510M/sec   77037259977 instr    81.450%
  Vector ops              1680.548M/sec  777517473692 ops
  Vector FP adds           670.028M/sec  309993331718 ops
  Vector FP multiplies     679.187M/sec  314230840083 ops
  Vector FP divides etc      6.804M/sec    3147863184 ops
  Vector FP misc            11.129M/sec    5148866098 ops
  Vector FP ops           1367.148M/sec  632520901083 ops      98.244%
  Scalar FP ops             24.432M/sec   11303869190 ops       1.756%
  Total  FP ops           1391.581M/sec  643824770273 ops
  FP ops per load                               4.532 flops/load
  Scalar integer ops        26.255M/sec   12146964198 ops
  Scalar memory refs        29.890M/sec   13828976669 refs      9.735%
  Vector TLB misses          0.000M/sec          2645 misses
  Scalar TLB misses          0.000M/sec           356 misses
  Instr  TLB misses          0.000M/sec           627 misses
  Total  TLB misses          0.000M/sec          3628 misses
  Dcache references         25.188M/sec   11653277955 refs     84.267%
  Dcache bypass refs         4.703M/sec    2175698714 refs     15.733%
  Dcache misses              6.320M/sec    2923896184 misses
  Vector integer adds        4.292M/sec    1985815905 ops
  Vector logical ops         8.002M/sec    3702187534 ops
  Vector shifts              5.657M/sec    2617061278 ops
  Vector int ops            17.951M/sec    8305064717 ops
  Vector loads             233.222M/sec  107901643404 refs
  Vector stores             43.923M/sec   20321299783 refs
  Vector memory refs       277.145M/sec  128222943187 refs     90.265%
  Scalar memory refs        29.890M/sec   13828976669 refs      9.735%
  Total  memory refs       307.035M/sec  142051919856 refs
  Average vector length                        44.316 
  A-reg Instr               60.553M/sec   28015211191 instr
  Scalar FP Instr           24.432M/sec   11303869190 instr
  Syncs Instr                4.864M/sec    2250427345 instr
  Stall VLSU               665.889secs   266355799373 clks
  Stall VU                1037.923secs   415169288518 clks
  Vector Load Alloc        205.741M/sec   95187337274 refs
  Vector Load Index          1.947M/sec     900922851 refs
  Vector Load Stride        25.407M/sec   11754815962 refs
  Vector Store Alloc        43.853M/sec   20289135770 refs
  Vector Store Stride        1.356M/sec     627230085 refs

Compiler Loopmark Listing

For Fortran codes, add "-rm" to your list of "ftn" options. E.g.:


  % ftn -O3 -rm -c mySubroutine.f 
ftn will compile the source and create a listing file (giving it the ".lst" suffix), like this:

  mySubroutine.lst

The .lst file shows the source code and optimizations. A legend at the top of the file explains the various symbols.

Here's an example of what you want to see: the loops where all the work is happening are marked with "MV", meaning the compiler successfully vectorized and streamed them.


 %%%    L o o p m a r k   L e g e n d    %%%

      Primary Loop Type        Modifiers
      ------- ---- ----        ---------
      A  - Pattern matched     b - blocked
      C  - Collapsed           f - fused
      D  - Deleted             i - interchanged
      E  - Cloned              m - streamed but not partitioned
      I  - Inlined             p - conditional, partial and/or computed
      M  - Multistreamed       r - unrolled
      P  - Parallel/Tasked     s - shortloop
      V  - Vectorized          t - array syntax temp used
      W  - Unwound             w - unwound

   588.  1-----<          do ispin = 1,1
   589.  1 MV--<             do n=1,nplwv
   590.  1 MV                   rvxc(n,ispin) = rvxc(n,ispin) + real(cexf(n,ispin))
   591.  1 MV-->             enddo
   592.  1----->          enddo
   593.
   594.
   595.  1-----<       do ispin=1,1
   596.  1 MV--<          do n=1,nplwv
   597.  1 MV                xcenc = xcenc + (xcend(n,ispin)-rvxc(n,ispin))
   598.  1 MV         $           *density(n,ispin)
   599.  1 MV                ecorec = ecorec +rvxc(n,ispin)*dencore(n)/float(1)
   600.  1 MV-->          enddo
   601.  1----->       enddo

At the end of the ".lst" file you'll find messages explaining why each loop was or wasn't vectorized or streamed (for instance, there was a dependency on variable "X", a non-vectorizable function call, etc..). Loopmark listing is now available for C programs, too.

Cray_pat

Loopmark listing (above) is only really useful if you know where the code spends its time. (A loop which accounts for %0.1 of the time can be ignored, even if it doesn't vectorize.)

Profiling your code helps focus your optimization efforts. Cray_pat is like Unix prof or gprof, and will show the percentage of time spent in each subroutine or function (or loop, if needed).

Here's how to get a basic profile. First, compile your code as usual, with all desired optimizations. Then:

Step 1:

"Instrument" the executable file for profiling.

The exact object files used when the file was linked must be available in their original locations because "instrumenting" the code automatically relinks it as well. The "pat_build" tool performs the task. In this example, a.out is a pre-existing executable file, and a.out.inst will be generated:

% pat_build a.out a.out.inst

Step 2:

Run the instrumented binary exactly as you'd run the original. This will produce an experiment file (with the suffix, .xf), containing output statistics for the run.

% ./a.out.inst

Step 3:

Generate a human-readable report from the .xf file using a second tool, "pat_report." E.g.:

% pat_report -i a.out.inst -o a.out.pat_report <.xf file>

Step 4:

View the report:

% more a.out.pat_report

The report will give you a table similar to this:

   100.0% 
    100.0% 
 2386567 
Total
  
---------------------------------------
  
  39.8% 
     39.8% 
  950376 
count_pair_position_
  
  17.1% 
     56.9% 
  406924 
vdw_compute_insert_
  
   6.8% 
     63.7% 
  163462 
compute_cb_environment_
  
   4.5% 
     68.2% 
  107056 
evaluate_envpair_
  
   3.9% 
     72.1% 
   92691 
__bcopy_prv
  
   3.7% 
     75.8% 
   89230 
vdw_compute_reset_
  
   3.7% 
     79.5% 
   87765 
setup_atom_type_
  
   2.7% 
     82.2% 
   64119 
bcmp
  
   2.6% 
     84.8% 
   62419 
evaluate_ss_
  
   2.0% 
     86.8% 
   48213 
_F90_FCD_ASG
  
   1.8% 
     88.6% 
   42818 
refold_coordinates_
  
   1.8% 
     90.4% 
   41899 
_F90_FCD_CMP_EQ
  
   1.4% 
     91.8% 
   33712 
memcmp
  
   0.9% 
     92.7% 
   21880 
setup_allatom_list_
  
   0.8% 
     93.5% 
   18594 
name_from_num_
  
   0.8% 
     94.3% 
   18371 
res1_from_num_
  
   0.7% 
     95.0% 
   17858 
__cis

Given this table, you know which loopmark listing file to examine first... (in this case, that which contains the subroutine "count_pair_position").

If you need help with any of these tools, contact ARSC consulting (consult@arsc.edu). Also see our "getting started" document for the X1:

http://www.arsc.edu/support/howtos/usingx1.html

Quick-Tip Q & A



A:[[ I'm finally appreciating the benefits of the "find" command, but
  [[   here's a problem.  
  [[
  [[ When I use grep from a find command, grep doesn't tell me the names
  [[ of the files!  Sure I've got hits, but what good is it if I can't
  [[ tell what files they're in?
  [[
  [[   % find . -name "*.f" -exec grep -i flush6 {} \;
  [[                include(flush6)
  [[                   !!dvo!! include(flush6)
  [[             !!dvo!! include(flush6)
  [[                include(flush6)
  [[
  [[  Any suggestions? 



# 
# Many thanks to nine (yes, 9) responders.  There was duplication, so here
# are 5 responses which cover the range of answers.
# 

#
# John Skinner
#
You have to add an extra filename for grep. /dev/null works best:

  % find . -name "*.f" -exec grep -i "program rir" /dev/null {} \;

This is needed because grep won't list the filename of a match when
given only one file on the command line or when a wildcard like *.f only
expands to one filename. Since find's -exec option runs grep on only one
filename at a time, grep never gets two or more files on its command
line. Add /dev/null to get 2 files each time grep is run, with one of
them guaranteed to NEVER match.

You can also "turn around" the find/grep,

  % grep -i "program rir" `find . -name "*.f"`

but check this out when *.f winds up being only one filename:

  % ls *.f
    r.f

What the heck! Where's my filename, with either method??

  % grep -i "program rir" `find . -name "*.f"`
        program rir
  
  % find . -name "*.f"  -print 
 xargs  grep -i "program rir"
        program rir

Again, add an extra filename for grep:

  % grep -i "program rir" `find . -name "*.f"` /dev/null
  ./r.f:      program rir

  % find . -name "*.f"  -print 
 xargs  grep -i "program rir" /dev/null
  ./r.f:      program rir


#
# Brad Chamberlain
# 
The key is to find the flag on your grep command that prints the filename, 
since find will call grep on each file one by one.  On my desktop
systems (linux-based), it's --with-filename, so I use:

        find . -name "*.txt" -exec grep --with-filename ZPL {} \;


#
# Daniel Kidger
#
Many versions of grep (eg.. Gnu) have a -H option. This  prefixes the
output with the filename.  The -n option of grep is handy too - it shows
the line number in the file.  Also I generally prefer to use 'xargs'
rather than the slightly clumsy '-exec' option of grep. (the -l option
feeds one line at a time to whatever command follows).

Hence
$ find . -name "*.f" 
xargs -l grep -inH getarg
./danmung.f:91:           call getarg(1,file_in)
./danmung.f:92:           call getarg(2,file_out)
./danfe.f:529:!     .. cf use of GETARG, & if NARG = 0.

(Note, in years gone by 'find' often needed a '-print' option in the
above.)


#
# Jed Brown
#
You are probably looking for the -H option for grep (most versions).
Otherwise, you can use:

% grep -i flush6 `find . -name "*.f"`

since usually, grep prints the name of the file if it receives several
arguments on the command line.  If this does not work or if it exceeds
the maximum number of command line arguments, you can always do
something like:

% echo 'a=$1; shift; for f in $*; do grep $a $f 
 sed "s
^
$f:\t
"; done' > mygrep
% chmod a+x mygrep && find . -name "*.f" -exec ./mygrep "-i flush6" {} \;


#
# Kurt Carlson
#
In ksh syntax:

find . -name "*.f" -print 
 while read F; do
  grep -i flush6 $F >/dev/null; if [ 0 = $? ]; then echo "# $F"; fi
done




Q: Are data written from a fortran "implied do" incompatible with a
   regular "read"?  If so, is there a way to make them compatible,
   without rewriting the code?  

   I just want to read data elements one item at a time from a
   previously written file.  Here's a test program which attempts 
   to show the problem:



iceflyer 56% cat unformatted_io.f

       program unformatted_io
       implicit none

       integer, parameter :: SZ=10000, NF=111
       real, dimension (SZ) :: z
       real :: z_item, zsum
       integer :: k

       zsum = 0.0
       do k=1,SZ
         call random_number (z(k))
         zsum = zsum + z(k)
       enddo
       print*,"SUM BEFORE: ", zsum

       open(NF,file='test.out',form='unformatted',status='new')
       write(NF) (z(k),k=1,SZ)
       close (NF)

       zsum=0.0
       print*,"SUM DURING: ", zsum

       open(NF,file='test.out',form='unformatted',status='old')
       do k=1,SZ
         read(NF) z_item
         zsum = zsum + z_item
       enddo
       close (NF)

       print*,"SUM AFTER: ", zsum
       end

iceflyer 57% xlf90 unformatted_io.f -o unformatted_io
  ** unformatted_io   === End of Compilation 1 ===
  1501-510  Compilation successful for file unformatted_io.f.
iceflyer 58% ./unformatted_io  
   SUM BEFORE:  5018.278320
   SUM DURING:  0.0000000000E+00
  1525-001 The READ statement on the file test.out cannot be completed 
  because the end of the file was reached.  The program will stop.
iceflyer 59%

[[ Answers, Questions, and Tips Graciously Accepted ]]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top