ARSC T3E Users' Newsletter 176, August 27, 1999

Chaining NQS Jobs

During the life of the ARSC T3D and T3E, we've evolved policies governing the submission of NQS jobs. Simply stated, you can't have more than one job per queue at a time. This policy seems to be well-received, and, we hope, ensures fair access. (See "news queue_policy" on yukon for details.)

The policy creates an inconvenience, though, for users who know in advance that they want to run more than one job (for instance, over the weekend).

Thus, we permit one NQS job to submit its own successor, as long as jobs in all other queues, including low-priority queues, have a chance to start before the successor job starts. In other words, one job may submit another, but all jobs must start at the back of the line.

There are various ways to implement this, but we strongly recommend "chained" as opposed recursive (or "self-submitting") qsub scripts. We've seen SEVERAL examples of infinite recursion in scripts gone bad.

As described in the "quick-tip" of newsletter #150, here's what we mean by "chaining":


  job_A.script  does its normal work, then submits   job_B.script
  job_B.script    "   "     "     "     "    "       job_C.script
  job_C.script    "   "     "     "     "    "       job_D.script
  job_D.script    "   "     "     " and terminates.
Next question: If these are running in a high-priority queue, how can they yield to other users between jobs, as required?

Our recommended solution is for the NQS script to:

  1. execute "qalter -l mpp_p=0" to releases its APP PEs,
  2. submit a null job to trigger an NQS rescan of the queues,
  3. sleep while NQS starts other work,
  4. submit the next job in the chain.

For instance:


  job_A.script
  ------------------
  [... normal #QSUB options, mpprun parallel apps, etc.. ...]
  
  # This appears at the very end of the script
  qalter -l mpp_p=0                # Release this job's PEs
  qsub do_nothing.script           # Force NQS to rescan the queues
  sleep 20                         # Give NQS time to start other jobs
  qsub job_B.script                # Submit next job in chain 
Here's the null job.

  do_nothing.script
  ------------------
  #QSUB -q mpp                     # This will run in the single queue

  echo "This script did nothing, and ran at: "  
  date
Please tell us your stories about using NQS scripts.

Post-Processing Standard Output Files From NQS Jobs

We got an interesting question last week.

A yukon user had two chained jobs (see previous article), but before the second job started, he wanted to extract and reformat some information from the standard output file generated by the first job.

For the uninitiated, when you submit an NQS job, for instance:


  YUKON$ qsub myjob
    Request <34488.yukon>: Submitted to queue <mpp> by <myname(3325)>.
  YUKON$ 
NQS assigns it an NQS-ID number (34488 in this case). When the job runs (in the middle of the night, perhaps), all the information that would normally go to your terminal is stored to temporary files. When the job terminates, these files are moved to the directory from which you submitted the job. They are given names with the extensions,

  .o<NQS-ID>  and 
  .e<NQS-ID>

for standard output and standard error, respectively. E.g.:


  myjob.o34488
  myjob.e34488

Our user's first thought was to start the script which post-processed standard output from the same NQS request that was creating it. By launching the script with "nohup" and putting it in the background using "&", e.g.,


  nohup postprocessor &
he thought that the NQS request might complete normally and copy over the .o file, while the post-processor would continue to run in the background. Then, after a sleep command, the post-processor would have access to the .o file.

This solution wouldn't work, though, because even if a command runs in the background with nohup, the NQS parent request won't terminate until all of its children exit.

Another possibility was to add the post-processor as an intermediate NQS request into the chain.

Our user ultimately chose a simpler approach. He didn't actually need everything that was being written to standard output, just the standard output from one particular application. Thus, he could redirect the standard output from that particular application to a separate file, e.g.,


  mpprun -n 50 ./a.out > a.out.stdout
run the post-processor further down in the same NQS script, and chain the next job.

Here's a sample of an NQS script to do all this (and to "chain" safely, as described in the last article):


  ##################################################################
  #QSUB -q mpp
  #QSUB -l mpp_p=60
  #QSUB -l mpp_t=7:00:00
  
  cd ~/progs           # cd to working directory

  # Run main application, save standard output to file
  mpprun -n60 ./myprog > myprog.stdout    


  qalter -l mpp_p=0                    # Release all 60 APP PEs

  qsub do_nothing.script               # Force NQS to rescan the queues

  ./postprocessor < myprog.stdout      # Run post-processor on a CMD PE

  sleep 20                             # Give NQS time to start other jobs

  qsub next_job_in_chain.qsub          # Submit next job
  ##################################################################

CUG T3E Workshop Preliminary Technical Program

For more on the upcoming T3E conference, see:

http://www.fpes.com/cugf99/

The following preliminary program is from:

http://www.fpes.com/cugf99/pages/prelmprg.htm

Cray User Group T3E Workshop October 7-8, 1999 Princeton, New Jersey

Preliminary Program


Wednesday October 6
   8:00-10:00
     PM Opening Reception at Nassau Inn (sponsored by SGI)

Thursday October 7
   7:00 Continental Breakfast (provided)
   8:30 Welcome, Sally Haerer, CUG President, NCAR and
        Helene Kulsrud, Workshop Chair, CCR-P
   8:45 The New Corporate Viewpoint, Steve Oberlin, SGI
   9:15 Software Overview, Mike Booth, SGI
  10:00 Break
  10:30 T3E Performance at NASA Goddard, Thomas Clune and
        Spencer Swift, NASA/SGI
  11:00 Performance on the T3E, Jeff Brooks, SGI
  11:30 Improving Performance, Mike Merrill, DOD
  12:00 Lunch (provided)
   1:30 Performance Evaluation of the Cray T3E for AEGIS Signal
        Processing, James Lebak, MIT Lincoln Laboratory
   2:00 Approaches to the parallelization of a Spectral
        Atmosphere Model, V. Balaji, GFDL/SGI
   2:30 Getting a Thumb-Lock on AIDS: T3E Simulations Show
        Joint-Like Motion in Key HIV Protein, Marcela Madrid, PSC
   3:00 Break
   3:30 Cray T3E Update and Future Products, William White, SGI
   4:30 Discussion and Q & A
   7:00 Dinner at Prospect House, Princeton University
        (provided by CUG)

Friday October 8
   7:00 Continental Breakfast (provided)
   8:00 Tutorial on Co-arrays, Robert Numerich, SGI
   9:00 Introduction to UPC, William Carlson & Jesse Draper, CCS
   9:30 Break
  10:00 HPF and the T3E: Applications in Geophysical
        Processing, Douglas Miles, PGI
  10:30 First Principles Simulation of Complex Magnetic
        Systems: Beyond One Teraflop, Wang, PSC; Ujfalussy,
        Wang, Nicholson, Shelton, Stocks, Oak Ridge National
        Laboratory; Canning, NERSC; Gyorffy, Univ. of Bristol
  11:00 Massively Parallel Simulations of Plasma Turbulence,
        Zhihong Lin, Princeton University
  11:30 Early Stages of Folding of Small Proteins Observed in
        Microsecond-Scale Molecular Dynamics Simulations, Peter
        Kollman & Yong Duan, University of California, San
        Francisco
  12:00 Lunch (provided)
   1:30 T3E Scheduler and OS Configuration Experiences on Large
        Systems, Dave Poulin, DOD/SGI
   1:40 Update on NERSC PScheD Experiences, Michael Welcome, NERSC
   2:05 Programming Tools on the T3E: ARSC Real Experiences and
        Benefits, Guy Robinson, ARSC
   2:35 Bad I/O and Approaches to Fixing It, Jonathan Carter, NERSC
   3:05 Break
   3:30 Running the UK Met. Office's Unified Model on the Cray
        T3E, Paul Burton, UKMET
   4:00 Achieving Maximum Disk Performance on Large T3E's:
        Hardware, Software, and Users, John Urbanic and Chad
        Vizino, PSC
   4:30 I/O and Filesystem Balance, Tina Butler, NERSC
   5:00 Wrapup

PGHPF Lecture at ARSC, Sept 24

ARSC has scheduled a lecture by Doug Miles of The Portland Group (PGI) for Friday, September 24 at 10:00 in the Sherman Carter conference room (Butrovich 204).

Here's Doug's abstract:

PGHPF High Performance Fortran: Successful applications and optimal programming techniques on shared- and distributed-memory systems

First introduced in 1993, the HPF programming language and HPF compilers have matured considerably. Early compiler implementations were often incomplete and inefficient. They also suffered from undue expectations on the part of users wanting transparent migration of existing serial Fortran applications to distributed-memory systems. Despite the difficult startup process, the runaway success of MPI as a low-level programming model, and the advent of OpenMP for shared-memory systems, the HPF standard has stood the test of time and retains its attraction as the only high-level Fortran programming model for both shared- and distributed-memory systems.

In addition to the gradual tuning and increasing functionality of compilers like PGHPF, effective use of HPF has been aided in part by a shift in the perceived role of the language. An HPF compiler is not a completely automated means of parallelizing applications. It is a tool that eliminates most or all of the low-level coding details required by MPI or other explicit parallel programming models. In particular, an HPF programmer still must understand the application, develop a parallelization strategy, and think carefully about its implementation. HPF then often (but not always!) allows a very concise and straightforward means to express the resulting algorithm, with a several fold reduction in programming effort over a message-passing equivalent.

This talk will include brief overviews of several existing production HPF applications, and an overview of the current capabilities, strengths, and weaknesses of the PGHPF compiler. Optimal HPF programming will be discussed, with a view toward practical advice on efficient language features, appropriate compiler options, and how to approach difficult problems such as irregular data accesses and parallel I/O.

To read the man pages and use pghpf on yukon, load the pghpf module:

yukon> module load pghpf

ARSC: Access Labs Open House

ARSC has opened a new Access Lab in the Natural Sciences Building at UAF. The following Labs are now available to ARSC users, and supplied with SGI Octane, O2, and OnyxL servers:
  • Natural Sciences Building, Room 161
  • Duckering Building, Room 234
  • Elvey Building (Geophysical Institute), Room 221
  • Butrovich Building, Room 007
We've been having "Open Houses" in the labs, and two remain:
  • Thurs, Sept 2, 10am-12pm 221 Elvey
  • Fri, Sept 3, 10am - 12pm 007 Butrovich
Please drop by to see demos, chat with staff, etc. For details, see:

http://www.arsc.edu/pubs/bulletins/VisLabOpenHouse.shtml

ARSC: Fall Course Schedule

Here's ARSC's 1999 fall course schedule:
  • Sept 15: ARSC Extravaganza for New and Prospective Users
  • Sept 22: Introduction to Unix
  • Sept 29: T3E Basics and Parallel Programming Flyby
  • Oct 13: Data Visualization Possibilities
  • Dec 8: Parallel Computing, Real Applications and Examples
The courses are all on Wednesdays at 2pm. For details, see:

http://www.arsc.edu/user/Classes.html

Quick-Tip Q & A



A:{{ Does any shell available under UNICOS/mk offer file name 
  {{ completion, as tcsh does?



# Thanks to two readers:
##################################################################

Both ksh and sh offer file name completion, try, for instance

        $ touch foobar
        $ set -o vi
        $ ls fo<ESC>\

and you will see the shell completes to foobar (assuming this is
unique).  There is also an emacs completion mode for users not familiar
with vi.

However, both tcsh and bash are publicly available for U/mk, so there
should be no need for users to learn anything new regarding filename
completion. :-) We have been using both for years on our T3E system.

##################################################################

In csh:    set filec

Simple as that!  Then (from the man page):

   filec  Enable filename completion, in which case the EOT
          character ( <Ctrl-d> ) and the ESC character have special
          significance when typed in at the end of a terminal 
          input line:

             EOT  Print a list of all filenames that start with the
                  preceding string.

             ESC  Replace the preceding string with the longest unambiguous
                  extension.

##################################################################


  Editor's Note #1: Unfortunately, csh in UNICOS and UNICOS/mk doesn't 
have filename completion.  The above csh solution is correct, however,
and works fine under IRIX.

  Editor's Note #2:  ARSC, like SGI/Cray, has decided not to support 
bash or tcsh.  




Q: I moved some files from my home directory to my /tmp directory,
   double checked them with "ls," but the next day they were gone,
   gone, gone.  I didn't delete them and I know that the purger on /tmp
   only removes files over 10 days old.  What happened?

[ Answers, questions, and tips graciously accepted. ]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top