ARSC HPC Users' Newsletter 208, November 17, 2000
Job Openings at ARSC
Individuals are being sought to fill new positions at ARSC:
- Parallel Software Specialist
- HPC Vector Specialist
- User Services Consultant
- Senior High Performance Computing System Programmer/Analyst IV
For details, see:
Editor's Comment: If you're interested in these positions, don't be scared off by the winters! This is a land of stunning beauty, especially in the winter, and, like several ARSCers, you might even come to enjoy skiing on your lunch hour.
SV1 Cache and Vector Relationships
To better understand the SV1 processors, we've been testing several different user codes. One shows an interesting effect of the vector cache.
This is a finite difference code. It uses Fortran 90 features extensively and is dominated by 4-point stencil operations as shown here:
! Get maximum of each point and its neighbors field_max (:,:) = MAX ( field (:,:) & , SHIFT (field (:,:), SHIFT=-1, DIM=1) & , SHIFT (field (:,:), SHIFT=+1, DIM=1) & , SHIFT (field (:,:), SHIFT=-1, DIM=2) & , SHIFT (field (:,:), SHIFT=+1, DIM=2))
In this example, the entire array is shifted four times, N, S, E, and W, and then some function, "MAX" in this case, is performed on the original and four shifted arrays. This approach leads to an uncluttered, easy to read code and the effect is the same as the approach using nested DO loops:
do i = ... do j = ... field_max (i,j) = MAX (field (i , j ) & , field (i+1, j ) & , field (i-1, j ) & , field (i , j+1) & , field (i , j-1) enddo enddo
"Flowtrace" was used on runs of the program to determine the subroutines which accounted for the most time. Then, to facilitate testing, a driver program was written to invoke the three subroutines identified by flowtrace as dominant.
One of the tests was to do multiple runs of the driver with different problem sizes. (This amounted, basically, to changing the size of the array "field", above.) The total amount of work was kept approximately constant by increasing the number of iterations as the problem size decreased. Test were performed on from 1-8 CPUS and on one MSP. The Hardware Performance Monitor (hpm), which was described in the last newsletter , and the Job Accounting tool (ja) were used to to measure the following fields:
- Elapsed time
- CPU Memory references per second (per processor)
- Cache Hits per second (per processor)
- MFLOPS (per processor)
Here are a few observations concerning the results:
- From the "Cache Hits/Sec" graph, cache use peaks, drops off, and eventually disappears as the problem size increases. At a problem size of about 60x60, all the arrays presumably fit within cache. As the problem grows further, the multiple array shift operations overwrite more and more of the data in cache until none of the cached data is useful.
- MFLOPS and CPU Mem Refs/Sec are directly related and the shape of the two graphs are nearly identical. Given the volume of data movement, the nature of this code is to be constrained by memory bandwidth. And, as oft repeated at the SV1 CUG last month, memory bandwidth drives performance....
- CPU Mem Refs/Sec are strongly affected by cache use. The traces appear to be underlain by a monotonically increasing curve with the Cache Hits/Sec shape added on top. The best CPU MFLOPS observed occurs in the 1-CPU case, where cache use is best.
- After cache becomes irrelevant, CPU Mem Refs/Sec improves with increasing problem size. This, we assume, is the expected benefit of increased vector length on a vector processor. The longer the vector the better, as startup costs become a smaller percentage of the total vector processing. Another oft repeated comment at the CUG: "the SV1 is STILL a vector machine!"
Cache use, CPU Mem Refs/Sec, and per processor MFLOPS are all diminished with increasing numbers of CPUS, although this effect is reduced (asymptotically?) as the problem size grows.
Presumably, this results from the autotasking mechanism. Particular array chunks may not be getting scheduled to the same processors on successive iterations. A cache miss occurs when the data needed by CPU-1 is cached on CPU-2.
Performance on one CPU of a multi-streaming processor (MSP) beats that of one CPU in a group of 4 autotasked CPUs. In fact, it about matches that of one CPU in a group of 2 autotasked CPUs.
To understand this, here are some relevant features of the SV1:
- There are 4 CPUs per module
- The maximum memory bandwidth to a module is 5 GB/sec
- The maximum memory bandwidth to a CPU is 2.5 GB/sec
- An MSP is a new way to combine the resources of 4 CPUs and in an MSP, each CPU is taken from a different module
Thus, if 4 CPUs are autotasked, and happen to be drawn from the same module, and if they're working on a memory intensive code like this, then they will compete for the module bandwidth of 5GB/sec and in the worst case, obtain a quarter of that, or 1.25GB/sec each.
If 2 CPUs are autotasked, they can expect 2.5GB/sec each.
On the other hand, if 4 CPUs are combined into an MSP, the bandwidth available to each will not be limited by the module bandwidth, and can be as high as 2.5GB/sec each.
What the graphs in this test may show is exactly this effect of the CPU vs module bandwidth of the SV1. (Note that the tests were run on a dedicated system, so there were no other jobs competing for bandwidth or CPUs.)
The complexity of these curves suggest that there may be no universal formula for optimization. However, the most common recommendation from the SV1 CUG was: first improve vectorization and only then worry about cache.
We encourage you to share your experiences using the SV1.
Update on T3E Galaxy Simulations
Fabio Governato and L.Mayer (Milan, Italy and University of Washington) together with their colleagues in Milan (Italy), Durham (UK) and at the "N-Body shop" (UW) have performed on yukon, the ARSC T3E, the highest resolution simulation to date of the dynamical evolution of a dwarf galaxy.
This simulation took more than 10000 node hrs to evolve 3 million particles for 7 Gyrs, as the dwarf galaxy orbited around a a larger galaxy (similar to our Milky Way). The mass resolution is just 50 solar masses per particle!
This simulation, together with others performed at ARSC and at the italian supercomputing center CINECA:
are part of a project that demonstrated the dynamical evolution undergone by disk satellites infalling onto the two giant spirals of the Local Group.
Small satellites with thin rotating stellar disks get transformed into dwarf spheroidals (where stars move on isotropic orbits) by the strong gravitational field. Large stellar streams, perhaps observable by future satellite missions, are created in the process.
A Web page has been set up with more information on the Local Group and a collection of mpeg movies and gif images of interacting galaxies in the Local Group environment:
Results will be published in Astrophysical Journals Letters .
Quick-Tip Q & A
A:[[ Speaking of inlining, it won't work if the subroutine to be [[ inlined contains Fortran 90 "OPTIONAL" arguments. Unfortunately, [[ my most-frequently called subroutine does indeed have an OPTIONAL [[ argument, named "MASK": [[ [[ SUBROUTINE SUB_ORIG (FIELD, MASK) [[ [[ I've thus replaced SUB_ORIG with two subroutines: one which requires, [[ and one which lacks, the OPTIONAL argument. The code changes were [[ trivial, and the new declarations look like this: [[ [[ SUBROUTINE SUB_nom (FIELD) ! no mask [[ SUBROUTINE SUB_msk (FIELD, MASK) ! mask is required [[ [[ Now the hard part. [[ [[ The original subroutine was called ~360 times, in some 100 source [[ files (of 300 total) which reside in 3 different source directories, [[ and it's called in two different ways, depending on the need for the [[ optional argument. The original calls look, for example, like this: [[ [[ CALL SUB_ORIG (G_BASAL_SALN (:,:)) [[ CALL SUB_ORIG (O_DP_U (:,:,K,LMN), MASK = O_MASK_U (:,:)) [[ [[ I need to update all these calls with the appropriate replacements: [[ [[ CALL SUB_nom (G_BASAL_SALN (:,:)) [[ CALL SUB_msk (O_DP_U (:,:,K,LMN), O_MASK_U (:,:)) [[ [[ How would you manage this? Here's a perl script that works in most cases (it doesn't handle, for instance, CALL statements split over more than one line and it is case sensitive). Call it "replace.prl": #!/usr/local/bin/perl -pi.bak m/CALL *SUB_ORIG/ && m/MASK *=/ && s/SUB_ORIG/SUB_msk/ && s/MASK *=// m/CALL *SUB_ORIG/ && ! m/MASK *=/ && s/SUB_ORIG/SUB_nom/ ; You could invoke it on the files in the three directories as follows (with obvious assumptions about the file and directory names): ./replace.prl source1/*.f source2/*.f source3/*.f That's it. Done. This is so powerful, it's scary. The magic is in the "-p" and "-i" switches in the interpreter command, "#!/usr/local/bin/perl -pi.bak2". -p : assumes a loop around the script. The loop iterates over every line of the input files. -i : edits the input files "in-place". The ".bak" is an option to "-i" which causes a backup of each original file to be saved, to, in this case, files with the extension ".bak". Q: According to hpm, my SV1 code gets 30 million cache hits/second. How can I tell if this is really improving its performance?
[[ Answers, Questions, and Tips Graciously Accepted ]]
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.