ARSC HPC Users' Newsletter 319, July 1, 2005
Creating Sequences of Batch Jobs in PBS: Part I of II
Many users of HPC systems require multiple long runs to complete a single simulation or experiment, and, often, the separate runs must be processed in sequence.
One method for creating a sequence of batch jobs is what ARSC has in the past termed, "chaining." (See issues: 176, 259, and 297.) To create a "chain" of jobs, each batch script, as its final act before terminating, executes the "qsub" or "llsubmit" command to submit its successor. We strongly discourage recursive, or "self-submitting," scripts because we've seen them go awry too many times and flood the system. Instead we recommend a simple, finite chain in which "A" submits "B," "B" submits "C", and "C" stops. (This could go up to "Z" or even "ZZ," of course.)
For some jobs, chaining isn't an option. It's an unpleasant way to end a run, but codes which write their own frequent restart files are sometimes allowed to simply run out of time. The batch system kills them when they hit the time limit, and the user submits a follow-on job which picks up again at the most recent restart file. Chaining won't work for such jobs because the batch script is halted when the application code is killed. Thus, subsequent script commands, like the "qsub" or "llsubmit" which might create the chain are not processed.
Fortunately, LoadLeveler and PBS both allow users to move the logic for chaining from the script and into the scheduler. The LoadLeveler feature was discussed in the article "Using LoadLeveler Job Steps" in issue #307:/arsc/support/news/hpcnews/hpcnews307/index.xml
In PBS on the X1, you use the "qsub -W depend=..." option to create dependencies between jobs.
The three most useful types of supported dependencies are probably "afterany," "afterok," and "afternotok." These are used as follows, where "<JOB-ID>" is the PBS ID number of a previously submitted job, and "<QSUB SCRIPT>" is a regular qsub script:
qsub -W depend=afterany:<JOB-ID> <QSUB SCRIPT> qsub -W depend=afterok:<JOB-ID> <QSUB SCRIPT> qsub -W depend=afternotok:<JOB-ID> <QSUB SCRIPT>
From "man qsub," here's the description of these attributes:
afterany:jobid[:jobid...] This job may be scheduled for execution after jobs jobid have terminated, with or without errors. afterok:jobid[:jobid...] This job may be scheduled for execution only after jobs jobid have terminated with no errors. See the csh warning under "Extended Description". afternotok:jobid[:jobid...] This job may be scheduled for execution only after jobs jobid have terminated with errors. See the csh warning under "Extended Description".
From these descriptions, it's obvious that you can use the error condition of the predecessor job to either halt or perpetuate the sequencing of jobs, as needed. In the scenario described earlier, it is planned that jobs will run into the time limit and be killed (which creates an error condition). Presumably, if a job completed without error, it would signify the clean end of the entire sequence of runs (e.g., the solution converged, the final timestep was processed, etc.). Thus the goal would be to continue the sequence on error but halt it if there's no error.
In the next issue, I'll give more details and an example.
ARSC is currently in the process of deploying checkpointing functionality on our two IBM systems: iceberg and iceflyer. Checkpointing allows a running program to be saved to a file. At a later time the program can be restarted from the previous point of execution using the checkpoint file. Checkpointing should allow for increased utilization of the systems prior to downtimes and benefit long jobs which have no built in checkpointing facilities.
The loadleveler keyword 'checkpoint' specifies whether or not a job should be considered for checkpointing.
# The following specifies that a job can be checkpointed. # @ checkpoint = yes
Unlike standard loadleveler scripts, jobs with checkpointing enabled must be executable. Script that are not executable and request checkpointing will be rejected by Loadleveler.
Below are a few limitations to checkpointing which may apply in particular to codes running on ARSC IBM systems.
- MPI programs must be compiled with the reenterant version of the compiler (e.g. mpcc_r, mpCC_r, mpxlf_r, mpxlf90_r, etc.)
- Any regular file that is open when the program is checkpointed must be available when the job is restarted. In particular, programs using on node scratch (e.g. /scratch) will most likely fail to restart because the job may not be located on the same node when restarted. The $WRKDIR and $HOME filesystems are accessible from all nodes and should not restrict restart.
- Processes using sockets, pipes or user shared memory have restrictions (see link below).
- There are a number of restrictions related to the use of pthread locks. (see link below).
- See the Loadleveler documentation for a complete list of checkpointing limitations:
We are looking for several volunteers to help assist in testing the checkpointing functionality. Please contact the ARSC help desk firstname.lastname@example.org for more details. Projects with little or no remaining allocation are especially encouraged to inquire.
Cray XD1: Nelchina
In May, ARSC began installing a Cray XD1 system. The system, named Nelchina, is currently being configured for use as an academic resource and should be available by the end of the year. Nelchina consists of 3 chassis with 6 nodes in each. Each node has two 2.4 GHz Opteron 250 processors and 4 GB of RAM. Additionally, one chassis has 6 field programmable gate arrays (FPGAs).
This summer several Ph.D. candidates from George Washington University are visiting the Arctic Region Supercomputing Center to investigate the FPGA technology on the Cray XD1. From their experiences we hope to get a better understanding of the problems that the system is best suited to solve.
Quick-Tip Q & A
A:[[ Sometimes I'll run an X1 PBS script interactively instead of [[ through "qsub," to test the basic syntax of the shell script. [[ Here's a sample script (with just the basics remaining): [[ [[ #PBS -l walltime=4:00:00 [[ #PBS -l mppe=8 [[ #PBS -q default [[ [[ cd $PBS_O_WORKDIR [[ aprun -n 8 ./a.out [[ [[ I'm totally annoyed, though, because I usually forget that in an [[ interactive run, the PBS variable PBS_O_WORKDIR doesn't get set! [[ So, when the script hits this line: [[ [[ cd $PBS_O_WORKDIR [[ [[ it cd's my session to my home directory and everything fails until [[ I remember to go back and comment out the "cd $PBS_O_WORKDIR". [[ Then, of course, when I'm done with interactive tests and submit [[ the real, batch, run, I forget to UNcomment the "cd", and everything [[ fails again! [[ [[ Any ideas to help me out? # # Martin Luthi # A very simple way to do this would be to test for the program name, available in the script as $0. The exact syntax depends on the shell language, but here is an example for Bourne-Shell: ====================== #!/bin/sh if [[ $0 != "qsub" ]] then cd $PBS_O_WORKDIR fi ====================== # # Lee Higbie # In the script you use to enter your interactive session, or as soon as you enter it, type: export PBS_O_WORKDIR=`pwd` depending on the system and shell, you may have to type setenv instead of export or separate setting the variable and exporting it onto two lines. # # Ed Kornkven # We can test $PBS_O_WORKDIR to see if it is a non-empty string. If it is, we assume that it contains the directory that we want to change to. If $PBS_O_WORKDIR is not set then the test fails and the "cd" is not executed. Using the usual block-if, we write: if [[ -n "$PBS_O_WORKDIR" ]] ; then cd $PBS_O_WORKDIR fi Alternatively, in the Korn shell (and probably others), one can put the command in a list construct where the first list item is the test which, if successful, allows the command to execute: [[ -n "$PBS_O_WORKDIR" ]] && cd $PBS_O_WORKDIR Q: Here's a question from former editor, Guy Robinson: Read any good computation/parallel programing/science books recently? If so, send title and short review.
[[ Answers, Questions, and Tips Graciously Accepted ]]
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.