ARSC HPC Users' Newsletter 222, June 15, 2001

ARSC CUG Papers Available in pdf

Shared-Memory Vector Systems Compared

http://www.arsc.edu/pubs/technical/pdf/robinson.200105.pdf

Robert Bell, CSIRO Guy Robinson, ARSC

ABSTRACT:

The NEC SX-5 and the Cray SV1 are the only shared-memory vector computers currently being marketed. This compares with at least five models a few years ago (J90, T90, SX-4, Fujitsu and Hitachi), with IBM, Digital, Convex, CDC and others having fallen by the wayside in the early 1990s. In this presentation, some comparisons will be made between the architecture of the survivors, and some performance comparisons will be given on benchmark and applications codes, and in areas not usually presented in comparisons, e.g. file systems, network performance, gzip speeds, compilation speeds, scalability and tools and libraries.

SV1e Performance of User Codes

http://www.arsc.edu/pubs/technical/pdf/baring.200105.pdf

Tom Baring, ARSC Jeff McAllister, ARSC

ABSTRACT:

The first SV1e processor upgrade at a user site was accomplished at the Arctic Region Supercomputing Center (ARSC) on April 11, 2001. In general, the CPU upgrade, in advance of the "X" memory upgrade, has not improved performance as much as might be expected. We discuss performance data collected on several significant user codes as part of understanding what allows some codes to take advantage of CPU speed increase alone while others require corresponding CPU and memory rate improvements for increased performance. This upgrade underscores an important point: dissociating performance from memory speed is becoming an increasingly important part of using modern computer architectures.

FFTs and Multitasking on the SV1

[ This work done by Tom Baring, ARSC. ]

INTRODUCTION

A number of FFT functions are available on the SV1 through libsci, as "man -k fft" shows. Several (in particular, 2D and 3D routines) are identified as "multitasked." However, an ocean modeling code used at ARSC makes extensive use of the complex-to-real routines, SCFFTM and CSFFTM, which, while they perform multiple 1D FFTs, are not documented as multi-tasked.

To explore the potential for improving multitasking performance of the user code, I implemented two additional schemes using OpenMP. This enabled me to perform timing experiments using three different approaches to doing multiple 1D FFTs. Briefly, the approaches are as follows:

  1. The user's original version, which I'll call: "SCFFTM"

    Performs FFTs on the entire collection of waves with single calls to the multiple-FFT functions, SCFFTM and CSFFTM.

  2. "Blocked SCFFTM"

    Breaks the collection of waves into groups of 8 (or any other number); calls SCFFTM/CSFFTM once per group; uses OpenMP to allow these individual FFTM calls to proceed in parallel.

  3. "SCFFT"

    Breaks the collection into single waves; calls SCFFT and CSFFT once per wave; uses OpenMP to process the FFT calls in parallel.

According to this experiment:
  • SCFFTM and CSFFTM are highly vectorized for single CPU performance,
  • When NCPUS=1, SCFFTM and CSFFTM achieve better performance than any of the other routines on any number of CPUs,
  • "Blocked SCFFTM" achieves the best performance whenever NCPUS > 1,
  • With small problem sizes, SCFFTM and CSFFTM achieve poor performance when NCPUS > 1.

There are a couple of implications for users. First, don't assume that performance will improve or wallclock time decrease when you use multiple CPUs. You should perform timing comparisons using different values of NCPUS (including 1) before embarking on a large suite of runs. Second, performance improvements can sometimes be achieved though fairly minor changes to code (or compiler options). Third, vectorization remains critical to performance on the SV1, despite the inclusion of cache memory.

The rest of this article describes the multiple FFT tests and results in greater detail.

SCFFTM/SCFFT AND THE TEST CODE

For testing, the following subroutine "DERIV" was extracted from the user code, and rewritten using OpenMP. The purpose of the routine, as described by the user, is to make the derivative of the variables in the Fourier space. It first makes an FFT of the variable, then multiplies by i*k (this corresponds to a derivative in the real space, and k is here the wave number along the direction we are interested in, ie (1:nnx)/nnx or (1:nny)/nny) and then it comes back into the real space with an inverse FFT.

Here is the original encoding, which, as mentioned, will be referred to as the "SCFFTM" version:


      SUBROUTINE DERIV(AA,BB,NNL,NNT,NNLP2,NNTP2,WKL)

      include 'parms.test.h'

      DIMENSION AA(NNXP2,NNY)
      DIMENSION BB(NNXP2,NNY)
      COMPLEX AAT(NC,NNY)
      DIMENSION WKL(NNX)
      COMMON/TABLES/TABLE1(100+2*NNX),TABLE2(100+2*(NNX+NNY)),
     +              WORK1((2*NNX+4)*NNY),WORK2(512*NNX)
C

      FNL=1.0/FLOAT(NNX)
      CALL SCFFTM(-1,NNX,NNY,FNL,AA,NNXP2,AAT,NC,TABLE1,WORK1,0)

C
      DO 19999 ILC=1,NC
        DO 19998 IT=1,NNY

          AAT(ILC,IT) = CMPLX(0.,WKL(ILC))*AAT(ILC,IT)

19998   CONTINUE
19999 CONTINUE

      CALL CSFFTM(1,NNX,NNY,1.0,AAT,NC,BB,NNXP2,TABLE1,WORK1,0)
C
      RETURN
      END

The NNX and NNY parameters to the multiple-FFT routines called above, are critical to understanding the rest of these tests. NNX is the length of each 1D FFT performed. NNY is the "lot size", or total number of FFTs performed. All columns of the input array AA are processed in a single call to SCFFTM.

Here's the first rewrite, "Blocked SCFFTM":


      SUBROUTINE DERIV(AA,BB,NNL,NNT,NNLP2,NNTP2,WKL)

      include 'parms.test.h'

      DIMENSION AA(NNXP2,NNY)
      DIMENSION BB(NNXP2,NNY)
      COMPLEX AAT(NC,NNY)
      DIMENSION WKL(NNX)
      COMMON/TABLES/TABLE1(100+2*NNX),TABLE2(100+2*(NNX+NNY)),
     +              WORK1((2*NNX+4)*NNY),WORK2(512*NNX)

      integer, parameter :: bfct=8
      integer :: blkn, nblk, blkstart, blkend, blksz

C
      FNL=1.0/FLOAT(NNX)

!
!
!
! use blocking to call scfftm with 16 or 8 row chunks... multitask
!
!
!
      nblk = 1 + NNY / bfct     ! integer division


!$OMP PARALLEL DO default(shared),
!$OMP+            private(blkstart,blkend,blksz)

      do blkn = 1, nblk 
        blkstart = (blkn - 1) * bfct + 1
        blkend = MIN (blkstart + bfct - 1, NNY)
        blksz = blkend - blkstart + 1

        CALL SCFFTM(-1,NNX,blksz,FNL,AA(:,blkstart:blkend),
     &            NNXP2,AAT(:,blkstart:blkend),NC,TABLE1,WORK1,0)

        DO ILC=1,NC
          DO IT=blkstart,blkend
            AAT(ILC,IT) = CMPLX(0.,WKL(ILC))*AAT(ILC,IT)
          enddo
        enddo

        CALL CSFFTM(1,NNX,blksz,1.0,AAT(:,blkstart:blkend),
     &            NC,BB(:,blkstart:blkend),NNXP2,TABLE1,WORK1,0)

      enddo

      RETURN
      END

In this, "Blocked SCFFTM," version, the lengths of the FFTs remain NNX, but the number of FFTs performed per call is "blksz," which will generally equal the parameter, "bfct." The OpenMP PARALLEL DO directive is required to parallelize the blocked loop. Even when compiled with -Otask3, the compiler would otherwise reject the loop for autotasking because it contains subroutine calls.

Here's the final rewrite, the "SCFFT version":


      SUBROUTINE DERIV(AA,BB,NNL,NNT,NNLP2,NNTP2,WKL)

      include 'parms.test.h'

      DIMENSION AA(NNXP2,NNY)
      DIMENSION BB(NNXP2,NNY)
      COMPLEX AAT(NC,NNY)
      DIMENSION WKL(NNX)
      COMMON/TABLES/TABLE1(100+2*NNX),TABLE2(100+2*(NNX+NNY)),
     +              WORK1((2*NNX+4)*NNY),WORK2(512*NNX)

      COMMON/TJB_TABLES/TJB_TABLE1(100+4*NNX),
     +              TJB_WORK1(4+4*NNX)


C
      FNL=1.0/FLOAT(NNX)

!$OMP parallel do default(shared)
      do irow=1,NNY
        CALL SCFFT (-1, NNX, FNL, AA(:,irow), AAT(:,irow), 
     &              TJB_TABLE1,TJB_WORK1,0)

        DO ILC=1,NC
          AAT(ILC,irow) = CMPLX(0.,WKL(ILC))*AAT(ILC,irow)
        enddo

        CALL CSFFT (1, NNX, 1.0, AAT(:,irow), BB(:,irow), 
     &              TJB_TABLE1,TJB_WORK1,0)
      enddo

      RETURN
      END

This "SCFFT" version replaces the multiple-FFT routines with a loop which manually performs FFTs, one at a time, using SCFFT/CSFFT.

To produce timings, a "driver" program is compiled and then linked with each of the three versions of "DERIV" in turn. Here's the heart of the driver:


      SVY(:,:) = ranf ()       

      tstart = second ()

      do niter = 1, 100
        CALL DERIV(SVY,SVYout,NNY,NNX,NNYP2,NNXP2,YK)
      enddo

      tend = second ()

      write (*,'("CPU time, total loop nest: ", f8.2)') tend - tstart

#ifdef DUMP
      write (*,'(8(f10.5," "))') ((SVYout(i,j),i=1,NNXP2),j=1,NNYP2)
#endif

Note: the actual driver performs the same tasks on 14 arrays rather than just "SVY" (extracted here to save space). Thus, DERIV is called 1400 times for each timing and, of course, DERIV then performs NNY FFTs of length NNX.

TIMINGS AND CONCLUSIONS

In the user's code, NNY = NNX = 64. Thus, these became the most important dimensions for testing. The first set of results, Figure 1, compares multitasking performance of the three versions of the subroutine for this 64x64 size:

figure 1

"Wallclock" time was obtained from the driver program's internal timer. These runs were made on ARSC's SV1 which currently sports 500MHz SV1e processors, during the afternoon, when the system was ranging between 80-100% busy. The blocking factor in "Blocked SCFFTM" was fixed at 8.

Considering that an identical amount of work was completed for each point along the curves in Figure 1, and wallclock time increased with increasing numbers of CPUs, this is disappointing. One would hope that doubling NCPUS would halve wallclock time. It might be interesting to run these tests on a lightly loaded system, but, as the user doesn't have this option, it wouldn't be especially helpful.

Part of the slowdown with increasing CPUs is probably due to system overhead in synchronizing, spawning, and otherwise managing multiple tasks. As shown in Figure 2, another component in the slowdown is reduced vector performance:

figure 2

In this plot, the vector performance is reflected as the ratio of MFLOPS to MIPS. In the best case, a single vector instruction can perform 64 floating point operations, so the y-axis could conceivable range from 0 to 64.

The dramatic feature of this plot is the decline in SCFFTM's vector performance going from 1 to 2 CPUs. The man page is correct in not declaring this a multitasked function. As the system parses out bits of work to different processors, it seems to be shortening vector lengths or otherwise degrading performance of the algorithm which works so well on one processor.

Another factor reducing performance on multiple CPUs is reduced cache efficiency, as shown in Figure 3:

figure 3

This plot gives the ratio of the number of cache hits per second to memory references per second, thus, the number of cache hits per memory reference. Note that the raw cache hits value includes instruction as well as data cache hits, which is my best explanation for the observation that the ratio exceeds 1 for the SCFFT version. The biggest hit is to the original SCFFTM version in going from 1 to 2 CPUs, but all versions suffer a decline in cache performance. Some decline in cache efficiency is expected: cache is local to each processor but tasks are swapped in and out of processors, possibly between processors, as required by the task scheduler.

The pattern of increased wallclock time with increased NCPUS seems to hold for larger problem sizes as well. Results of tests of a 256 x 256 case are shown in Figure 4:

figure 4

SCFFTM on 1 CPU is, again, the fastest overall, but the penalty for using it on 2 or more CPUs is less than it was in the 64x64 case. As before, the Blocked SCFFTM version is the best on multiple CPUs.

This raises another question: is there a better value for the blocking factor? The following plot (figure 5) freezes NCPUS at 4, and experiments with this parameter for both the 64x64 and 256x256 cases (in all of the previous plots, "bfct" was frozen at 8):

figure 5

To a degree, larger blocking factors seem to be better for larger problems. The plots aren't given here, but for the 256x256 problem, vectorization improves steadily through bfct=32, but cache efficiency starts dipping down after bfct=24.

The user of the actual code is still experimenting with the "Blocked SCFFTM" subroutine, and, because the penalty for using SCFFTM on 2 or more CPUs is high, has been running on 1-CPU otherwise.

As mentioned before, the implication for all users is clear: don't just assume that your code will run faster on multiple CPUs. (ARSC users: if you want to do some timings, and aren't sure how, contact consult.) Also, consider blocking algorithms which can improve both cache and multitasking performance.

Arctic Ocean Modeling Fellow Sought

A Post Doctoral position is currently being offered at the Center for Atmosphere-Ocean Science (CAOS) at the Courant Institute of Mathematical Sciences, New York University.

A Fellow is sought to participate in the Arctic Ocean Modeling Intercomparison Project (AOMIP), a multi-institutional project supported by the International Arctic Research Center (IARC) of the University of Alaska.

Further details on the project are found at URL

http://fish.cims.nyu.edu/project_aomip/overview.html

David Michael Holland Courant Institute of Mathematical Sciences 251 Mercer St., Warren Weaver Hall, 907 New York University, MC-0711 New York City, New York, 10012-1185 USA

Title: Assistant Professor Mathematics Phone: (212) 998-3245 Fax: (212) 995-4121 Email: holland@cims.nyu.edu http://fish.cims.nyu.edu/~holland

Quick-Tip Q & A



A:[[ I share lots of big data files with my group, using regular Unix
  [[ permissions.  Most of the files have been DMF migrated off disk and
  [[ onto tape.  Quota is applied to disk files, but not migrated files.
  [[
  [[ DMF makes it really easy for one person to manage his/her files, as
  [[ the commands are easy, and storage is basically unlimited.  The
  [[ problem I have is that group members can "dmget" my files as needed
  [[ (which is what I want) but they can't "dmput" them when they're
  [[ through.  At the moment, group members are sending me email when I
  [[ need to re-migrate a file they've used, but it's inconvenient and I
  [[ keep hitting my quota unsuspectingly.
  [[
  [[ Any suggestions?



# Editor's Note: ARSC will implement a more robust version of the script
# described below, and make it accessible to all users.  We'll run a
# follow-up article as soon as the script is available.


Thanks to Jeff McAllister of ARSC:
----------------------------------

The following solution will allow selective dmputting within a group:

      1) You create a file (i.e. "dmput.lst") somewhere in your
directories.  This file will contain a list of files to be migrated, and
will need group-write permissions.  It's important, for obvious reasons,
that write should be allowed for group and owner only. It's also
important that this file not be in your home directory so that this
directory is always only writable by its owner.

   2) you would then create a short script (here I'll call it
"dmputter"), containing this line:

   cat (absolute path to dmput.lst) 
 dmput -r

Permissions on this script should allow write and execute by owner
only.  This could reside in your home directories (and may even be
safest there, as ARSC scripts would verify that the directory has no
non-owner write permissions).  As a modification of this script has
potentially large security implications, you will need to accept this
risk and ensure that world-write is never granted to this file.

   3) Placing the absolute path to dmputter in your crontab entries 
will cause the system to run dmput, as you, with a file list editable by
members of your group.  Because it works on a short list of files, it
would not hurt to run every hour.

    A sample crontab entry running out "myaccount" on /u1/uaf
would look like this (entered by crontab -e)

    0 * * * * /u1/uaf/myaccount/dmputter

This would execute the script on the 0th minute of every hour.

   4) After a group member unmigrates a file owned by you, and
finishes working with it, he (assuming its a "he") would add it to your
dmput.lst file.  For example, if he had unmigrated
/u1/uaf/myaccount/somedir/x.dat and you had the dmput.lst file in
/u1/uaf/myaccount/dmputter/dmput.lst he could do something like
this:

echo "/u1/uaf/myaccount/somedir/x.dat" >> /u1/uaf/myaccount/dmputter/dmput.lst

He should specify the full path to the file.  At the beginning of the
next hour, cron would run dmputter as you, and dmput the file. It would
no longer count against your /u1 quota.

   5) If this script executes multiple times on a relatively short 
list of specific files, it won't waste too much time.  However, you
should remove files from the list once they're re-migrated.  Also, don't
use wildcards that could be expanded to lots of files.  Just looking up
the migration status over one heavily populated directory can easily
degrade DMF performance for everyone, even if all of the files are
already migrated.





Q: I'm using both ja and hpm to collect timing/performance data,
   like this:

     chilkoot$  (ja; hpm ./a.out; ja -cst) > a.out.timing

   The problem is that ja writes to stdout while hpm writes to stderr.
   Thus, the ja output is redirected successfully to a.out.timing, but
   the hpm output gets dumped to the screen.

   I'm not religious... csh, ksh, I don't care... just tell me which
   shell to use and what to do.  Thanks.

[[ Answers, Questions, and Tips Graciously Accepted ]]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top