ARSC HPC Users' Newsletter 275, August 22, 2003
IBM xprofiler and Optimization
[ Thanks to Kate Hedstrom of ARSC. ]
During a code tuning exercise with IBM ACTC visitors, I tried xprofiler, a gui version of gprof. To use xprofiler, compile your code with the "-g -pg -qfullpath" options. The -g flag does not conflict with optimization on the IBM. Run your code as normal and you'll have a gmon.out file. I was running a serial code on ferry, the interactive part of iceflyer, and my initial xlf flags were:
-O3 -qstrict -qarch=pwr4 -qtune=pwr4 -qdpc \ -qflttrap=enable:invalid:imprecise -u -qzerosize
These flags give the power4 instructions, a core dump on NaN, and constants promoted to double precision.Start up xprofiler and ask it to load your executable and the gmon.out
file [File->Load Files]. You should see a lot of little graphs. Go to [Filter->Uncluster Functions] and [Filter->Hide All Library Calls]. Now you should have a call graph of your program, with the height and width of the boxes representing the time spent in the functions. The width is the time in the routine plus all children; the height is the time in that routine alone. Go to [View->Zoom In] to see the names of the boxes. A right mouse on a box will bring up a menu, giving you the option of looking at the source code. You will be able to see the relative counts for each line of code (the higher the number, the more time spent there). You can also do [Report->Flat Profile] to get a gprof-like listing.
I started with a 2-D setup of FVCOM (Finite Volume Coastal Ocean Model) taking 240 seconds to run. Using xprofiler, we noticed that pow, sin, and cos were taking a lot of time. The first optimization was to use the MASS library, which provides a faster, less accurate version of many intrinsic functions, like these. Simply add -lmass to the load step when compiling. That alone sped up the code to 130 seconds. There is also a vector version of MASS (-lmassv) which requires some rewriting of the function calls in the code for an even greater speedup.
Before taking that step, we looked at the calls to determine in more detail what was happening.
The pow calls were being generated by (term)**0.5, which can be rewritten as sqrt(term). That change alone got us down to 100 seconds since sqrt is a hardware instruction on the Power4 processor. In looking at the calls to sin and cos, they were being performed every timestep on angles which remain constant. Pre-computing the sin/cos values dropped the execution time to 80 seconds.
When I described this optimization to the FVCOM development team, they replied that the sin/cos computations were not necessary at all, and a simpler form could be used. Cleaning that up and converting term/dz to term*(stored 1/dz) got us down to 75 seconds.
The code is now three times faster than the original, and the next step will be to profile it with a 3-D problem.
The results look equivalent to the original results using the "eyeball norm". A diff of old and new shows some differences at about 1.0e-6, which could be explained by getting rid of the extra sin/cos nonsense.
All in all, this effort was well worth the time. It took a day to learn xprofiler and make these changes. Note that I didn't have to try the vector MASS library at all, since the transcendental library calls were removed.
However, if you have a significant number of these calls, it could be worth looking into - a good example of when to use conditional compilation:
#ifdef MASSV call vsin(b, a, 100) #else do i=1,100 b(i) = sin(a(i)) end do #endif
# Editors note: # # xlf will automatically convert loops like that given above to vector # intrinsic function calls, if you give it the -qhot option. When using # aggressive optimization on any compiler, be sure to validate results. # # For a similar case study, see: # "Optimizing with IBM Vector Intrinsics and xlf -qhot" # in issue #250 .
Redbook on IBM Performance and Optimization Tools
[ From Jim Long of ARSC: ]
The "AIX 5L Performance Tools Handbook" just came out last week. See:
This IBM Redbook takes an insightful look at the performance monitoring and tuning tools that are provided with AIX 5L. It discusses the use of the tools as well as the interpretation of the results in many examples.
This book is meant as a reference for system administrators and AIX technical support professionals so they can use the performance tools efficiently and interpret the outputs when analyzing AIX system performance.
A general concept and introduction to the tools is presented to introduce the reader to the process of AIX performance analysis.
The individual performance tools discussed in this book fall into these categories:
- Multi-resource monitoring and tuning tools
- CPU-related performance tools
- Memory-related performance tools
- Disk I/O-related performance tools
- Network-related performance tools
- Performance tracing tools
- Additional performance topics, including performance monitoring API, Workload Manager tools, and performance toolbox for AIX.
Conditional Compilation: Part II
[ Second in a 2 part series, contributed by Kate Hedstrom of ARSC. ]
Last time, we covered conditional compilation with the C preprocessor, cpp. This time, we're going to cover the new coco part of the latest Fortran standard. The draft description of coco in the standard is on the web at:
A coco command has a '??' for the first two characters on the line. The rest of the syntax is meant to be Fortran-like:
?? LOGICAL, PARAMETER :: XX = .TRUE. ?? IF (XX) THEN call xx ?? END IFor:
?? integer, parameter :: XX = 1 ?? if (XX == 1) then call xx ?? end if
The coco commands can be in either upper or lower case and the space in "END IF" and "ELSE IF" is optional. The types are either integer or logical, constant parameters or not.
In cpp, the lines that aren't used are turned into blanks or are deleted. In coco, you get to choose one of five options: delete, blank, and three styles of Fortran comments (starting with !). For instance:
?? integer, parameter :: CRAY = 1, IBM = 2, SGI = 3 ?? integer :: system = IBM ?? if (system == IBM) then use ibm_mod ?? else if (system == CRAY) then use cray_mod ?? else use sgi_mod ?? end ifwill produce by default:
!?>?? integer, parameter :: CRAY = 1, IBM = 2, SGI = 3 !?>?? integer :: system = IBM !?>?? if (system == IBM) then use ibm_mod !?>?? else if (system == CRAY) then !?> use cray_mod !?>?? else !?> use sgi_mod !?>?? end if
A coco program will recognize a set file, a separate file which can be used to set the values of coco variables. For the above case, we can have a set file containing:
?? alter: delete ?? integer, parameter :: SGI = 3 ?? integer :: system = SGIproducing this output:
As you can see, the set file overrides any values set inside the coco program. Each coco program can have at most one set file and the set file can be shared by all the routines that make up a program.
The goal is that eventually, coco will be a part of the Fortran 2000 compiler system and you won't have to do anything. Right now, the major free implementation is by Purple Sage:
There is a claim that another is at:
but this site pops up a bunch of ads, then causes a core dump of my old netscape. Let's concentrate on the Purple Sage version. If you type:
% coco model
it will look for model.fpp as the input, produce model.f90 as the output, and look for model.set as the set file. If model.set doesn't exist, it will look for coco.set. Obviously, we need to be invoking coco in our Makefile for now:
.SUFFIXES: .o .fpp .f90 .fpp.o: $(COCO) $(COCOFLAGS) $< $(F90) -c $(FFLAGS) $*.f90 .f90.o: $(F90) -c $(FFLAGS) $< .fpp.f90: $(COCO) $(COCOFLAGS) $<
Building the Purple Sage coco is a multi-step process and they provide some example input files for PC compilers. To be perfectly honest, I haven't had any luck yet building it on our Unix platforms. Still, it is wonderful that they are willing to provide the source code, which means that it can and will be fixed. In the long run, coco will make the Fortran purists feel good about conditional compilation. Meanwhile, the rest of us will continue to get by with cpp and similar tools.
Quick-Tip Q & A
A:[[ I changed the optimization level for one compiler optimization option [[ in my makefile, remade everything, and now my program is getting [[ different results. There are over 75 source files. [[ [[ Any suggestions how I might find where this compiler option is causing [[ a difference? # # Thanks to Brad Chamberlain: # Well, the good news is that most optimizing compilers treat source files independently, so you can probably factor out interplay between the 75 source files, which reduces the combinatorics somewhat. A brute force way would be to compile 75 times, turning optimizations on for only one file at a time to determine where the problem is. Or you could use a binary search (compile half with optimizations, half without; depending on the result, try the opposite half) -- this assumes there's only one problem. Once you have the file in question, I tend to use printfs to determine where answers differ, painful as they are. Doing relative debugging between the two programs (optimized and unoptimized) would be the ideal way to approach this problem, but I don't think any of the relative debuggers have made their way far enough out of research-land to make their use worthwhile. Another more indirect approach would be to see if turning on additional warnings in the compiler, bounds checking, pointer checking, efence, whatever features are available to you will reveal any problems in your code that are changing the meaning of the code with optimizations (incorrect code is the most likely cause of optimizations changing behavior). # # From Guy Robinson: # Sometimes I've just compared the sizes of the object and other files output by the compiler. If it is only a small difference you are looking for this works well. A typical case is trying to see if inlining has been done. Also, the IBM and Cray compilers can both be asked to output intermediate, semi-readable listings. These, and other listings like loopmarks can be diff'ed from one compile to the next. Q: I like "mget" and "mput" in ftp, but I'm sick of answering "y", "y", "y", "y", "y", "y", "y, "y", "y", "y", "y"... when I know I want ALL the files! You may have experienced it. It goes like this: ftp> ftp> mget *.f mget adpott.f? y 227 Entering Passive Mode (199,165,85,37,4,128) 150 Opening BINARY mode data connection for adpott.f (500 bytes). 226 Transfer complete. 500 bytes received in 0.0022 seconds (2.2e+02 Kbytes/s) mget at.f? y 227 Entering Passive Mode (199,165,85,37,4,129) 150 Opening BINARY mode data connection for at.f (15682 bytes). 226 Transfer complete. 15682 bytes received in 0.009 seconds (1.7e+03 Kbytes/s) mget badolb.f? y 227 Entering Passive Mode (199,165,85,37,4,130) 150 Opening BINARY mode data connection for badolb.f (4543 bytes). 226 Transfer complete. 4543 bytes received in 0.012 seconds (3.6e+02 Kbytes/s) mget bccc.f? etc.... etc.... etc.... etc.... etc.... etc.... etc.... etc.... etc.... etc.... etc.... etc.... So I often log onto the remote system (when the files are in my own account, of course), make a tar file, and just "get" the tar file. Is there another way?
[[ Answers, Questions, and Tips Graciously Accepted ]]
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.