ARSC HPC Users' Newsletter 269, May 23, 2003
Cray X1 Installed at ARSC
Yesterday afternoon, ARSC's two cabinet Cray X1 was delivered. The X1 is the latest offering from Cray, Inc. The system should be available for allocated usage by Oct. 1. Here's an image of it in our machine room:
The system is named "klondike". It took 70 hours in a semi-tractor trailer rig to cover the 3400 miles from Chippewa Falls, Wisconsin to Fairbanks, Alaska, and 6 hours to be unloaded, set up, and powered up in our machine room.
Processors are being installed in two stages. Currently, it has 64, 12.8 gflop multi-streaming processors, and the second set of 64 will be installed later this summer.
Here's a 10-minute movie, courtesy of Leone Thierman of ARSC, of unloading and installation:
Stay tuned! We'll have a lot to say about the X1 in this newsletter.
SX-6 Memory Requirements for Parallel Jobs
[ Thanks to Ed Kornkven of ARSC. ]
SX-6 users may be familiar with the steps necessary to run a program under autotasking. Namely, the program must first be compiled with the -Pauto compiler option. Second, the NQS script for an autotasked batch job must specify the number of processors for NQS to allocate to the job (the -c option) as well as the number of microtasks to use in parallelizing the program (via the F_RSVTASK environment variable). These two numbers will typically be equal. Here is a simple NQS script for running "a.out" with two processors:
# # # # # # # # # # # # # Per-request CPU time #@$-lT 20:00 # # Per-request Memory #@$-lM 64MB # # Number of CPUs to use #@$-c 2 # # Combine stderr and stdout into one file #@$-eo # # Name of queue #@$-q batch # # Shell to use #@$-s /usr/bin/csh # # Change to work directory cd $QSUB_WORKDIR # # Detailed hardware stats setenv F_PROGINF DETAIL2 # # Number of microtasks setenv F_RSVTASK 2 # # Execute the program ./a.out # # # # # # # # # # # #
It isn't our purpose to give an in-depth NQS tutorial here, but rather to illustrate a potential pitfall and its obscure manifestation. When this script was executed recently, it seemed to only run on one processor as evidenced by the hardware statistics that were printed. A portion of those are reproduced here:
****** Program Information ****** Real Time (sec) : 12.620525 User Time (sec) : 11.534289 Sys Time (sec) : 0.032625 Vector Time (sec) : 11.463276 Inst. Count : 917380607. V. Inst. Count : 874561395. V. Element Count : 104537180864. FLOP Count : 82143494463. MOPS : 9066.878771 MFLOPS : 7121.678195 MOPS (concurrent) : 9066.903926 MFLOPS (concurrent) : 7121.697953 VLEN : 119.530980 V. Op. Ratio (%) : 99.959056 Memory Size (MB) : 48.000000 Max Concurrent Proc. : 1. Conc. Time(>= 1)(sec): 11.534257 : :
Note the last two lines. Only one processor is reporting. We were expecting to see something like:
****** Program Information ****** Real Time (sec) : 6.091186 User Time (sec) : 12.125577 Sys Time (sec) : 0.028495 Vector Time (sec) : 11.468935 Inst. Count : 937278529. V. Inst. Count : 874561403. V. Element Count : 104537182776. FLOP Count : 82143574467. MOPS : 8626.385359 MFLOPS : 6774.405413 MOPS (concurrent) : 17179.574337 MFLOPS (concurrent) : 13491.328818 VLEN : 119.530981 V. Op. Ratio (%) : 99.940041 Memory Size (MB) : 80.000000 Max Concurrent Proc. : 2. Conc. Time(>= 1)(sec): 6.088620 Conc. Time(>= 2)(sec): 6.037041 : :
In the second report we see clearly that two processors are at work and in fact giving a very nice speedup.
What happened in the first run? The hint is on the third line from the bottom in each of the hardware stat displays. The first program ran with insufficient memory for autotasking and the only sign of the fact is the odd hardware statistics output. When plenty of memory was added, program worked fine. It should be noted that 64MB is sufficient when autotasking is turned off.
How much is the "plenty" of memory that we need to use for autotasking? Well, the machine tells us: it used exactly 80MB. So we adjust our script to request 80MB instead of 64MB and the program... crashes! It also crashes with 81MB. 82MB works fine though. So "plenty" seems to mean "what the program reports plus a little more" and it is going to be more than what was required without autotasking.
Watch your OpenMP Environment
On the Cray SV1ex the default OpenMP schedule type for parallel DO loops is DYNAMIC. From CrayDoc, the chunk size is set as follows:
DYNAMIC is the default schedule type, depending on the type of loop, as follows: * For non-innermost loops, DYNAMIC is the default SCHEDULE type. The chunk size is set to 1. * For innermost loops without a REDUCTION clause, DYNAMIC is the default SCHEDULE type. The chunk size is set to the target machine's maximum vector length. For innermost loops with a REDUCTION clause, GUIDED is the default SCHEDULE type. This scheduling mechanism is described in the following paragraphs.
If your loops, inner or outer, are large, you may get better performance by overriding the default chunk sizes with larger values.
Here's a dramatic example, using the SMP version of the STREAM benchmark (see: http://www.streambench.org ), which is designed to be memory bandwidth limited and vector heavy, and not necessarily indicative of how your real application will respond. It is especially susceptible to having the flow of work chopped up into small pieces. The runs were made on a busy system.
- The scheduling type and chunk sizes are as noted.
- The "Triad" value is the memory bandwidth reported by STREAM's "Triad" test.
- The STREAM array size used was 2000000 array elements.
- These were run with 4 threads.
OMP_SCHEDULE: <<default>> Triad: 57.0 MB/sec CPU seconds : 140.87375 OMP_SCHEDULE: DYNAMIC,128 Triad: 5779.9 MB/sec CPU seconds : 1.28558 OMP_SCHEDULE: DYNAMIC,1024 Triad: 8624.3 MB/sec CPU seconds : 1.01457 OMP_SCHEDULE: STATIC,128 Triad: 10399.1 MB/sec CPU seconds : 0.87952 OMP_SCHEDULE: STATIC,1024 Triad: 9634.7 MB/sec CPU seconds : 0.92272
It's pretty clear that for this code, the default scheduling type isn't the best. You might play with some of these setting as you optimize your own codes.
> > WOMPAT 2003: Workshop on OpenMP Applications and Tools > June 26 - 27, 2003 in Toronto, Ontario Canada > http://www.eecg.toronto.edu/wompat2003 > > IMPORTANT DATES: > > <2003-May-26> Early registration deadline. > <2003-Jun-26> Opening of the WOMPAT 2003 Workshop. > > Registration information and the preliminary program are now available: > > http://www.eecg.toronto.edu/wompat2003 > > > ABOUT THE WORKSHOP: > > The OpenMP API is a widely accepted standard for high-level > shared-memory parallel programming. Since its introduction in 1997, > OpenMP has gained support from the majority of high-performance compiler > and hardware vendors. > > WOMPAT 2003 is latest in a series of OpenMP-related workshops, which > have included the annual offerings of WOMPAT, EWOMP and WOMPEI. > > > LOCATION: > > WOMPAT 2003 will be held at the Hilton Toronto in Toronto, Ontario, > Canada. There are a number of events and festivals to be held during > June 2003, you can find more information about events occurring around > the time of WOMPAT at the http://www.torontotourism.com web site. > > AN UPDATED NOTE ON SARS: > > On 20-May-2003, the Center for Disease Control in the United States > removed its travel alert for Toronto, Canada. This was done because > more than 30 days (or 3 times the SARS incubation period) had elapsed > since the onset of the last case. You can find the complete > announcement at the link below: > > http://www.cdc.gov/travel/other/sars_can.htm > > On 14-May-2003, the World Health Organization removed Toronto from the > list of areas with recent local transmission of SARS. This step was > taken after 20 days (twice the incubation period) passed since the last > locally acquired case of SARS had been isolated. According to the WHO, > "the chain of transmission is considered broken." For the complete text > of the WHO Update refer the the website below: > > http://www.who.int/csr/sars/archive/2003_05_14/en/ >
Quick-Tip Q & A
A:[[ I run a series of batch jobs (NQS, LoadLeveler, PBS, whatever). Each [[ run must create its own directory for output. My current method is to [[ manually edit the batch script for each run, typing the name for the [[ output directory, like this: [[ [[ OUTDIR="results.028" [[ [[ this variable is used later in the script, e.g.,: [[ [[ mkdir $OUTDIR [[ cd $OUTDIR [[ [[ I don't care much what names are used for the directories. Can you [[ recommend a way, if there is one, to come up with these names [[ automatically? # # Thanks to Richard Griswold: # One method is to append the PID to the directory name: OUTDIR="results.$$" mkdir $OUTDIR cd $OUTDIR A safer way is to use the mktemp command. If your system doesn't have mktemp, you can get it from http://www.mktemp.org/ OUTDIR=`mktemp -dq results.XXXXXX` exit 1 cd $OUTDIR # # Thanks to Brad Chamberlain: # I use the following technique in NQS. Suggestions for generalizing it for other queuing systems are mentioned at the end. Each NQS job has a unique identifier associated with it stored in an environment variable called QSUB_REQID. This number corresponds to the number you'll see when submitting jobs or checking on their status. As an example, if I submit a job as follows: yukon% qsub mg.8W nqs-181 qsub: INFO Request <20858.yukon>: Submitted to queue <mpp> ...for this submission, QSUB_REQID is 20858. I use this variable to make qsub output filenames unique using the following lines in my qsub script. ### towards the top with other QSUB options, I insert: #QSUB -o output/mg.8.out # (this specifies that output should go in my output subdirectory # and should be named mg.8.out) ### at the top of my actual set of commands, I insert: cd ~/qsub/output # (cd to the same output subdirectory named above) mv mg.8.out mg.8.$QSUB_REQID.out # (rename the previous output file created by this script to a # new unique name, created using this job's QSUB_REQID) This technique works because the output file generated by the QSUB -o directive doesn't appear in this directory until the script completes running. Thus, the mv command executes before the new mg.8.out file is ever created. Note that this technique will not store the output of job 20858 in file mg.8.20858.out as one might like. Rather, it will store the output of job 20858 in mg.8.out (for now), and the output of the previous job in mg.8.20858.out. I find I don't care much about the actual job number... keeping my files unique is sufficient, so this trick works. I then get summary information across a number of runs using commands like: grep Time mg.8.*out While it's tempting to put the $QSUB_REQID directly in the -o directive, it seems that variable names are not expanded there, so you will literally get a file called mg.8.$QSUB_REQID.out, which isn't terribly useful. Other queueing systems typically have similar built-in variables that are unique to each submission, but I don't know them offhand, so you'll need to read some man pages to find out how to do that. Another approach would be to use the built-in $$ variable provided by csh-like scripts to refer to a script's process number. This could be used instead of $QSUB_REQID above, for example (I prefer QSUB_REQID because it corresponds to a number that I have a better grasp of, even though it has the imperfect "off-by-a-submission" issue mentioned above). # # Editor's method: # This NQS script create a directory with the name "outdir.YYYYMMDD.HHMM" where YYYY is the year, MM is the month, etc... #QSUB -q batch #QSUB -lM 100MW #QSUB -lT 8:00:00 #QSUB -s /bin/ksh cd $QSUB_WORKDIR OUTDIR=outdir.$(date "+%Y%m%d.%H%M") mkdir $OUTDIR cd $OUTDIR And the result of a test: CHILKOOT$ qsub t.qsub nqs-181 qsub: INFO Request <11293.chilkoot>: Submitted to queue <batch> CHILKOOT$ ls -l -d outdir.* drwx------ 2 staff 4096 May 23 15:24 outdir.20030523.1524 Q: OUCH!!!!!!!!! I had, yes, note past tense... a couple files to save, several to delete, and some of those to delete had permission 400. They looked something like this: $ ll total 144 -rw------- 1 saduser sadgroup 4280 May 23 15:36 d -rw------- 1 saduser sadgroup 535 May 23 15:36 e -rw------- 1 saduser sadgroup 17120 May 23 15:35 f -r-------- 1 saduser sadgroup 8560 May 23 15:35 a -r-------- 1 saduser sadgroup 2140 May 23 15:35 c -r-------- 1 saduser sadgroup 1070 May 23 15:35 b To simplify my life, I did this, rm -i -f ? I expected "rm" to ask about each file before deleting it, and to take care of the "400" files automatically. Oh well.. it blasted them all, and didn't even ask. If there's a question in all this, maybe you could answer it. I'm too upset to think.
[[ Answers, Questions, and Tips Graciously Accepted ]]
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.