ARSC T3E Users' Newsletter 194, April 28, 2000

Specifying Front-End NQS Limits Unnecessary on Yukon

There's no reason to make front-end memory and time requests in your yukon (ARSC T3E) NQS scripts. "Front-end" work is the serial work that your script does on a single "command," or "CMD," processor, as opposed to the parallel work it does on multiple "application," or "APP," processors.

Yukon jobs are given the front-end maximums for the given queue, by default. Thus, anything you might request will just hurt you. (Note that other T3E sites may handle these limits differently.) In contrast, for the parallel work, you must always specify time and number-of-processor limits.

It's generally new T3E users, migrating from the J90, who experience this problem by including (or leaving) one or more of these options in their NQS scripts:

-lT
-lt
-lM
-lm
These four options request total time, per-process time, total memory, and per-process memory, respectively, on the serial, front-end processor. (Parallel requests are made with the options, -lámpp_t= and -lámpp_p= .)

As an example of the problem with front-end requests, what would this command do?


  YUKON$ qsub -lM 512MB -l mpp_p=20 -l mpp_t=8:00:00 -q mpp job.script
Because of the unsatisfiable front-end request, -lMá512MB (the limit on yukon is 256MB), the command would fail. NQS's handling of this type of error is especially confusing.

The qsub would silently return, printing neither the expected NQS Identifier nor an error message. Rather, it would send the user an e-mail message explaining that the request was unsatisfiable. We actually ran a "quick-tip" on this condition in issue #138

> (/arsc/support/news/t3enews/t3enews138/index.xml#qt)

if you'd like more information.

How about this qsub script?


  #QSUB -q mpp
  #QSUB -l mpp_p=20 
  #QSUB -l mpp_t=8:00:00 
  #QSUB -lT 1:00:00 
  
  cd ~/Project/Programs 
  ja
  mpprun -n20 ./mainprog
  qalter -l mpp_p=0                    # release parallel processors
  echo 
 qsub -q mpp -eo -o /dev/null  # force NQS to rescan queues
  ./postprocess 
  ja -cst
Based on the number of PEs and MPP time requested, this job would run in yukon's "medium" queue, which allows, by default, 8 hours of front-end time. If the job did more than an hour of pre-processing work (like compiling) plus post-processing work, however, it would be killed because of the unnecessary limit, -lTá1:00:00 .

(Before it does any single-PE post-processing, this job does the right thing by releasing its 20 APP PEs and forcing NQS to rescan the queues.)

Workshop on OpenMP Applications and Tools


>                           WOMPAT 2000
>             Workshop on OpenMP Applications and Tools
>      San Diego SuperComputer Center, San Diego, California
>                        July 6th-7th 2000
> 
> 
http://www.cs.uh.edu/wompat2000/

> 
> ************************************************************
> OpenMP has recently emerged as the definitive standard for
> shared memory parallel programming. For the first time, it
> is possible to write parallel programs which are portable
> across the majority of shared memory parallel computers.
> 
> The Workshop on OpenMP Applications and Tools (WOMPAT 2000)
> will be held on July 6 and 7 at the San Diego Supercomputer
> Center, San Diego, California. It is dedicated to the OpenMP
> language and its use. WOMPAT provides an opportunity for users
> and developers of OpenMP to meet, share ideas and experiences,
> and to discuss the latest developments.
> 
> 
> DEADLINES FOR ABSTRACT SUBMISSION AND REGISTRATION
>    Submission of abstracts:               5 May 2000
>    Notification of acceptance:           29 May 2000
>    Early registration deadline:          12 June 2000
>   
> 
> RELATED MEETINGS:
>   Note that there will also be a workshop focussed on OpenMP and
>   applications in Europe on September 14-15.  See
>   
http://www.epcc.ed.ac.uk/ewomp2000/
 for details.
> 
>   A Workshop on OpenMP Experiences and Implementations (WOMPEI) will be
>   held in Tokyo on October 16 - 18.
> 

Quick-Tip Q & A



A:{{ When should I use SHMEM_FENCE versus SHMEM_QUIET?  Is there
  {{ any difference between them?



  # Thanks to Etienne Gondet of IDRIS for this explanation:
  
  First at all, the routing algorithm on the T3E crossbar can be
  adaptive, that's means in case of bottleneck on the crossbar, the
  algorithm can choose another road (less direct but not overloaded) over
  the link between the PE (processor element).
  
  So the problem is for an asynchronous shmem routine like shmem_put that
  the arrival order between several calls may not be the same as the
  lexical order.
  
  So shmem_fence assures that the shmem_put before the fence will arrive
  before the shmem_put lexically after the fence.
  
  The quiet is like a local barrier, assuring that every shmem_put
  lexically before the shmem_quiet will arrive before the local PE (the
  executing one) continues execution after the shmem_quiet.


  # Other observations, by the editors:

  In C, these routines are macros and, by default, expand to the same
  function. This fragment,

        #include <mpp/shmem.h>
        
        void testme() {
          shmem_fence();
          shmem_quiet();
        }

  expands to this:

        void testme() {
          (_remote_write_barrier());
          _remote_write_barrier();
        }


  To get the different functions, undefine the macros (as shown in the
  man pages). This fragment,

        #include <mpp/shmem.h>
        #undef shmem_quiet
        #undef shmem_fence

        void testme() {
          shmem_fence();
          shmem_quiet();
        }

  expands to this:

        void testme() {
          shmem_fence();
          shmem_quiet();
        }

  So, which should you use?  This advice is from "man shmem_fence":

     NOTES
        The shmem_quiet function should be called if ordering of puts
        is desired when multiple remote PEs are involved.

  If ordering of PUTs is required by your algorithm, you should always
  use SHMEM_QUIET/SHMEM_FENCE.  Even if your T3E (like ARSC's "yukon")
  has adaptive routing turned off, it might one day be turned on, your
  code might land on another host, and the appearance of
  SHMEM_QUIET/SHMEM_FENCE self-documents the PUT dependencies nicely.




Q: In a similar vein, shouldn't I use SHMEM_BROADCAST instead of
   MPI_Bcast?  Don't they do the same thing?  And isn't SHMEM always
   fastest?

[ Answers, questions, and tips graciously accepted. ]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top