ARSC HPC Users' Newsletter 387, May 30, 2008
Measuring Program Performance using PAPI and Tau
[ By: Oralee Nudson and Ed Kornkven ]
When profiling and optimizing code, it is valuable to know just how efficient your program is running at the hardware level. One tool available on midnight which enables you to measure hardware statistics is the Performance Application Programming Interface (PAPI) provided by TAU. The TAU interface to PAPI gives you access to statistics such as the number of data cache misses, Vector/SIMD instructions executed, and other hardware counters. The following steps describe how to run an instrumented Fortran MPI program to capture hardware statistics using TAU and PAPI. An example program can also be found on midnight in $SAMPLES_HOME/parallelEnvironment/tau_papi_counters.
1) Set the TAU+PAPI environment variables. If you plan on using TAU+PAPI frequently, you may want to add these environment variables to your ~/.profile. Otherwise, be sure to set these in each open shell you'll be using.
Bash and ksh users:
% export TAU_MAKEFILE=$PET_HOME/tau/x86_64/lib/Makefile.tau-multiplecounters-pathcc-mpi-papi-pdt % export PATH=$PATH:$PET_HOME/tau/x86_64/bin- or -
Csh and tcsh users:
% setenv TAU_MAKEFILE $PET_HOME/tau/x86_64/lib/Makefile.tau-multiplecounters-pathcc-mpi-papi-pdt % setenv PATH $PATH:$PET_HOME/tau/x86_64/bin
2) Link and compile your program with the appropriate TAU+PAPI scripts. Compile the Fortran code using the TAU "tau_f90.sh" script. The script location was added to your $PATH during step 1. You'll also want to include the command line option to embed the location of the PAPI shared library into the executable: -Wl,-R/u2/wes/PET_HOME/pkgs/papi-3.5.0/lib64. So, when put all together, your compilation command should look something like this:
tau_f90.sh myfile.f90 -o myfile -Wl,-R/u2/wes/PET_HOME/pkgs/papi-3.5.0/lib64
3) Set the PAPI counter environment variables and execute the program. There are four hardware counters available on the AMD Opteron processors on midnight. To obtain a list of all hardware counters available on the machine, execute the "papi_avail" binary on a compute node. This can be done by first starting an interactive job then running the "papi_avail" executable:
% qsub -I % /u2/wes/PET_HOME/pkgs/papi-3.5.0/bin/papi_avail % exit
Once you decide on a list of events to be measured (remember 4 hardware counters is the maximum available on the midnight processors), you'll need to set the COUNTER[1-n] environment variables at runtime. In this example we will be measuring "total L1 cache accesses" and "L1 data cache misses". These variables can be set in either of the following two ways:
3.1) Inside your PBS job submission script, set the environment variables on the mpirun execution command line itself:
mpirun -np 4 COUNTER1=GET_TIME_OF_DAY COUNTER2=PAPI_L1_TCA COUNTER3=PAPI_L1_DCM ./myfile- or -
3.2) Write a script containing the counter values and execute it within the PBS job submission script on the mpirun execution command line (here we create a "launch.sh" script):
% cat > launch.sh export COUNTER1=GET_TIME_OF_DAY export COUNTER2=PAPI_L1_TCA export COUNTER3=PAPI_L1_DCM ./myfile <ctrl-D> % chmod u+x launch.sh
(Add the following to the PBS job submission script)
mpirun -np 4 ./launch.sh
Regardless the method used, make sure COUNTER1 is always set to GET_TIME_OF_DAY. This allows TAU to synchronize timestamps across MPI tasks.
4) Examine the output for the counters. E.g., for the above counters:
pprof -f MULTI__GET_TIME_OF_DAY/profile pprof -f MULTI__PAPI_L1_TCA/profile pprof -f MULTI__PAPI_L1_DCM/profile
Portland Group 7.1.6 Compiler Suite Available on Midnight
Version 7.1.6 of the Portland Group compiler suite is now available on midnight. The new version can be used by loading either the "PrgEnv.pgi.new" module or the "PrgEnv.pgi-7.1.6" module.
Release notes for this version of the PGI compiler suite are available on the Portland Group website: http://www.pgroup.com/doc/pgiwsrn716.pdf
Note that midnight does not use the version of MPI included with the PGI workstation release, so sections of the release notes pertaining to MPI are not relevant to midnight.
Quick-Tip Q & A
A:[[ ARSC's data archive system works best when directory trees are [[ stored as one tar file rather than many constituent files. [[ However, that makes comparing individual files in my working [[ directory with my archived copy a pain. Is there any way to [[ easily synchronize my working directory with my archive directory [[ when my archive is stored as a tar file? [[ # # We didn't receive any solutions for this question. If you have any # ideas on how to handle this, let us know! # Q: I frequently write scripts to build input files. The way I normally handle this is by using "echo" commands in the script to send the input file to stdout and then redirect stdout to the filename I would like to use. e.g. ./build_input > namelist.input Rather than redirecting stdout, I would like to have the output from echo go to a specific file (e.g. namelist.input). Is there a way to redirect stdout for a script within the script itself? I really hate having to redirect each "echo" statement individually.
[[ Answers, Questions, and Tips Graciously Accepted ]]
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.