ARSC HPC Users' Newsletter 401, March 6, 2009

Reporting Live From The Moose Invasion

As any motorist in central Alaska can surely attest, it has never been particularly difficult to happen upon a moose in Fairbanks. But according to University of Alaska Fairbanks police, it may have recently become much easier:

"The University Police Department has had a dramatic increase in the number of nuisance moose calls on the Fairbanks campus this year and the Alaska Department of Fish and Game is reporting an increase in the Fairbanks area moose population."

Be mindful, use common sense, and do not feed the local wildlife. If you are able to tame one of these beasts, however, hitching a moose ride to work may be the greenest of all alternative energies. [Editor's Note: Hybrids are tamed by design.]

ARSC Spellers Place Third in BizBee

The 2009 ARSC "Crayons" took third place out of 26 teams in the annual Biz-Bee, a spelling-bee fund raiser for the Literacy Council of Alaska. Three familiar faces composed this year's team: HPC Systems Analyst Dale Clark, Oceanographic Specialist Kate Hedstrom, and Chief Scientist Greg Newby.

Choosing to spell "steganopodous" over "breviloquence", the Crayons fell to this formidable foe. The winning word for the evening was "foggara", a French translation of the Arabic word "qanat", which is a type of water management system that can be used for irrigation in arid climates.

Other words that made appearances throughout the night were:


didactic
hartebeest
cuchifrito
nidificate

Congratulations, Crayons, for another great performance!

Valgrind's Cachegrind Profiler Tool

[ By Craig Stephenson ]

Previously, in newsletter 398, I covered the usage of Valgrind's Memcheck tool to catch unintended memory operations and memory leaks.

/arsc/support/news/hpcnews/hpcnews398/index.xml#article3

Memcheck is only one of the tools provided by Valgrind. In this article, I will discuss the basics of Valgrind's Cachegrind profiler tool.

Simply put, Cachegrind analyzes an executable's low-level instruction and data operations. It reports the total number of assembly instructions and memory reads in addition to the number of cache misses along the way. This information is critical to the performance-minded, as reducing the number of cache misses can equate to substantial improvements in speed. Cachegrind can generate reports as a summary of the entire program, per function, or line-by-line.

Running Cachegrind with its default options will produce a summary of the entire program. For example:


> /u2/wes/PET_HOME/bin/valgrind --tool=cachegrind ./example
...
==2950== I   refs:      1,999,515
==2950== I1  misses:        1,110
==2950== L2i misses:        1,090
==2950== I1  miss rate:      0.05%
==2950== L2i miss rate:      0.05%
==2950== 
==2950== D   refs:        594,114  (462,989 rd   + 131,125 wr)
==2950== D1  misses:       16,918  ( 15,298 rd   +   1,620 wr)
==2950== L2d misses:        7,868  (  6,641 rd   +   1,227 wr)
==2950== D1  miss rate:       2.8% (    3.3%     +     1.2%  )
==2950== L2d miss rate:       1.3% (    1.4%     +     0.9%  )
==2950== 
==2950== L2 refs:          18,028  ( 16,408 rd   +   1,620 wr)
==2950== L2 misses:         8,958  (  7,731 rd   +   1,227 wr)
==2950== L2 miss rate:        0.3% (    0.3%     +     0.9%  )
Profiling timer expired

(Remember from the previous newsletter article in this series, each line of Valgrind output is prefixed with the process ID.)

A quick key to interpret this output is as follows:


  I/i = instructions
  D/d = data
  I1 - level 1 instruction cache
  D1 - level 1 data cache
  L2 = level 2 shared instruction/data cache
  rd = data read
  wr = data write

This example had a level 1 data cache miss rate of 2.8% and a level 2 cache data miss rate of 1.3%. Not too bad.

A textbook example of the benefits of cache optimization comes from the distinction between row-major and column-major programming languages. To traverse the memory of a multi-dimensional array sequentially in a row-major language such as C or C++, the program should access each row in order. Conversely, the memory of a multi-dimensional array in column-major languages, such as Fortran, is sequenced by columns. Let's use Cachegrind to put this to the test with the following two equivalent programs:


Row-Major / Column-Major Comparison in C
----------------------------------------
#include <stdio.h>

void rowMajor()
{
  int A[1000][100][10];
  int i, j, k;

  for(i=0; i < 1000; i++)
  {
    for(j=0; j < 100; j++)
    {
      for(k=0; k < 10; k++)
      {
        A[i][j][k] = i + j + k;
      }
    }
  }
}

void columnMajor()
{
  int A[1000][100][10];
  int i, j, k;

  for(k=0; k < 10; k++)
  {
    for(j=0; j < 100; j++)
    {
      for(i=0; i < 1000; i++)
      {
        A[i][j][k] = i + j + k;
      }
    }
  }
}

int main()
{
  rowMajor();
  columnMajor();

  return 0;
}
----------------------------------------

Row-Major / Column-Major Comparison in Fortran 90
-------------------------------------------------
PROGRAM major
IMPLICIT NONE

  CALL rowMajor()
  CALL columnMajor()

END PROGRAM major

SUBROUTINE rowMajor()
IMPLICIT NONE
  INTEGER, DIMENSION(1000,100,10) :: A
  INTEGER :: i, j, k

  DO i = 1,1000 
    DO j = 1,100
      DO k = 1,10
        A(i,j,k) = i + j + k
      END DO
    END DO
  END DO

  RETURN
END

SUBROUTINE columnMajor()
IMPLICIT NONE
  INTEGER, DIMENSION(1000,100,10) :: A
  INTEGER :: i, j, k

  DO k = 1, 10
    DO j = 1, 100
      DO i = 1, 1000
        A(i,j,k) = i + j + k
      END DO
    END DO
  END DO

  RETURN
END
-------------------------------------------------
Each program performs the same operation using both orders, so viewing a summary of the entire program will not be terribly helpful. To see which of the two functions performs faster, we would be well advised to generate a per-function report using the cg_annotate command, via the following series of commands:

First, compile the program with the -O0 flag to disable optimization, and the -g flag to enable debugging information:


> pgcc -O0 -g -o major major.c
Then, run the program using Valgrind's Cachegrind tool:

> /u2/wes/PET_HOME/bin/valgrind --tool=cachegrind ./major
This will display the program summary in addition to writing a file named cachegrind.out.####, where #### is the process ID. Use this file as a parameter to the cg_annotate program:

> /u2/wes/PET_HOME/bin/cg_annotate cachegrind.out.7902
...
--------------------------------------------------------------------------------
        Ir I1mr I2mr        Dr D1mr D2mr        Dw    D1mw    D2mw  file:function
--------------------------------------------------------------------------------
18,807,008    2    2 7,303,003    0    0 1,101,002  62,501  62,501  major.c:rowMajor
18,008,088    2    2 7,003,033    0    0 1,001,012 840,985 623,193  major.c:columnMajor

According to this function profile, there is no doubt that C is a row- major language. Its row-ordered function produces a mere 62,501 level 1 data cache write misses compared to its column-ordered function's 840,985.

How does the Fortran 90 equivalent of this program fare?


> pgf90 -O0 -g -o major major.f90
> /u2/wes/PET_HOME/bin/valgrind --tool=cachegrind ./major
> /u2/wes/PET_HOME/bin/cg_annotate cachegrind.out.11321
...
--------------------------------------------------------------------------------
        Ir I1mr I2mr        Dr  D1mr D2mr        Dw   D1mw   D2mw  file:function
--------------------------------------------------------------------------------
16,706,007    2    2 8,303,002     0    0 1,202,003 92,258 62,999  major.f90:rowmajor_
15,007,067    3    3 8,003,032     0    0 1,002,023 62,500 62,500  major.f90:columnmajor_
...

Notice how comparable Fortran 90's column-major D1 write misses (62,500) are to C's row-major D1 write misses (62,501). Also worth noting is that C's cache misses appear to be significantly more expensive than Fortran 90's.

In reality, your code's functions are likely to be a tad more complex than a tight triple-nested loop. In this case, it might be worthwhile to look at an annotated line-by-line profile of one (or all) of your source files. The following is an example of how this is done.

First, I separated the Fortran 90 program used above into three separate files: one containing the main program, one containing the rowMajor() function, and one containing the columnMajor() function. As in the previous examples, the program needs to be compiled with optimization disabled and debugging enabled:


> pgf90 -O0 -g -o major rowmajor.f90 columnmajor.f90 main.f90
Then, the program needs to be run through Cachegrind using the same syntax as previous examples:

> /u2/wes/PET_HOME/bin/valgrind --tool=cachegrind ./major
Finally, cg_annotate is run with the full path of the source code file(s) you would like to analyze. For example:

> /u2/wes/PET_HOME/bin/cg_annotate cachegrind.out.23196 rowmajor.f90
...
--------------------------------------------------------------------------------
-- User-annotated source: /import/home/u1/uaf/user/major/rowmajor.f90
--------------------------------------------------------------------------------
        Ir I1mr I2mr        Dr D1mr D2mr        Dw   D1mw   D2mw 

         .    .    .         .    .    .         .      .      .  MODULE rowmajor
         .    .    .         .    .    .         .      .      .  CONTAINS
         .    .    .         .    .    .         .      .      .  
         .    .    .         .    .    .         .      .      .  SUBROUTINE rowMajor()
         .    .    .         .    .    .         .      .      .  IMPLICIT NONE
         .    .    .         .    .    .         .      .      .    INTEGER, DIMENSION(1000,100,10) :: A
         .    .    .         .    .    .         .      .      .    INTEGER :: i, j, k
         .    .    .         .    .    .         .      .      .  
         2    0    0         0    0    0         2      0      0    DO i = 1,1000 
     2,000    0    0         0    0    0     2,000      0      0      DO j = 1,100
   300,000    0    0         0    0    0   200,000      0      0        DO k = 1,10
12,000,000    1    1 5,000,000    0    0 1,000,000 92,258 63,000          A(i,j,k) = i + j + k
 4,000,000    0    0 3,000,000    0    0         0      0      0        END DO
   400,000    0    0   300,000    0    0         0      0      0      END DO
     4,000    0    0     3,000    0    0         0      0      0    END DO
         .    .    .         .    .    .         .      .      .  
         1    1    1         0    0    0         0      0      0    RETURN
         2    0    0         2    0    0         0      0      0  END
         .    .    .         .    .    .         .      .      .  
         .    .    .         .    .    .         .      .      .  END MODULE rowmajor
...
If you would rather see the annotated source code for all source code files at once, use cg_annotate's --auto=yes option. E.g.,

> /u2/wes/PET_HOME/bin/cg_annotate --auto=yes cachegrind.out.23196
Cachegrind can very easily reveal instruction and data bottlenecks in your code's performance, as seen in these examples. I find myself wanting to run Cachegrind on every code I have ever written to put my programming efficiency to the test.

For more information, refer to Valgrind's Cachegrind manual at the following URL:

http://valgrind.org/docs/manual/cg-manual.html

Craig Stephenson, Editor

This issue marks the beginning of Craig Stephenson's newsletter editorship.

Craig has been an ARSC consultant since 2006 having served as a student employee for two years before that. He was born and raised in Fairbanks and earned his B.S. in Computer Science from UAF in 2006. He enjoys reading about history, philosophy, and science, working on random programming projects, blogging, and is also an avid bowler. We have also confirmed with former editor Tom Baring, who has known all the HPC Newsletter editors, that Craig is the tallest editor in the history of this periodical.

Quick-Tip Q & A


A:[[ I am writing a script, call it A, that calls another script B.
  [[ I want to maintain both scripts in the same directory but I need
  [[ to be able to call A from anywhere.  The problem I'm having is that
  [[ A needs to be able to refer to the directory it is stored at in
  [[ order to find B.  I tried `pwd` as the path to B but that gives me
  [[ the directory I was in when I called A, not the directory that A
  [[ (and B) are stored in.  Is it possible to do this with a shell
  [[ script?  I guess I could use Perl if Perl has a way of doing it.
  [[ Python?  Help!


#
# Reader Ryan Czerwiec made this straightforward suggestion:
#

If script B is in a constant location, then why not just have its
location hardwired in script A?  Instead of using something like:
    $directory_variable/B.csh
Use:
    /really/long/pathname/B.csh
This is the way I always do it, since I keep all of my scripts in a
common directory, so I always know where they are.

Alternatively, if you add the directory containing A and B to your PATH
variable in your .cshrc (or equivalent) file, A and B will run without
any pathname necessary, provided that there aren't scripts with the same
names in a higher-priority part of $PATH.  This would also require that
you NOT use the -f option at the top of your scripts in the line
#!/bin/csh (or other shell equivalent).  If you need the -f for other
reasons, the hardwired pathname method should work fine.

#
# Chris Petrich, Dale Clark, and Greg Newby pointed out that referencing
# a shell script's $0 variable will disclose the path used to invoke the
# script, whether the script was invoked using a full or relative path.
# Here is Chris Petrich's response:
#

The variable $0 contains the file name of the script complete with the
relative path from your current directory to the directory of that
script. For example, using bash expansion to remove the script's name
you could change to the directory of the script with

cd ${0%/*}

#
# This command will work in ksh and sh, too.
#

#
# Greg Newby and Brad Havel also suggested the following alternative, in
# Greg's words:
#

I'm infering from the question that "A" is in your $PATH.
If so, all you really need is `which A` to insert the full
location into B; use dirname to strip out the filename part.

From the command line:
# ls -l ${HOME}/.bin/A ${HOME}/.bin/B

-rwxr-xr-x  1 newby  staff  0 Feb 12 16:26 /Users/newby/.bin/A
-rwxr-xr-x  1 newby  staff  0 Feb 12 16:26 /Users/newby/.bin/B

# which A
/Users/newby/.bin/A

# dirname `which A`
/Users/newby/.bin/

So, within your A script, something like:

# Find A's location:
bbaseloc=`dirname \`which A\``

# Run B from that location:
${bbaseloc}/B

#
# Scott Kajihara offered a sed approach:
#

csh:

set DIRECTORY = `which A 
 sed 's!^\(/.*\)/[^/]*$!\1!'`

sh:

DIRECTORY=`which A 
 sed 's!^\(/.*\)/[^/]*$!\1!'`

Making these shell variables environment variables is left as
an exercise to the original submitter.

#
# Andrew Roberts combined $0 and dirname for this solution which doesn't
# depend on A being on the $PATH:
#

`dirname $0`/B

#
# Brad Havel discovered that this functionality is present in a Perl
# module as well:
#

If nobody has brought it up yet, Perl has a module that performs the
same functions, probably with better performance than shelling out from
whatever script is being used.

use File::Basename;

($name,$path,$suffix) = fileparse($fullname,@suffixlist);
$name = fileparse($fullname,@suffixlist);

$basename = basename($fullname,@suffixlist);
$dirname  = dirname($fullname);

The module looks to be standard distribution as of Perl 5.8.7 for sure,
but mileage may vary if it is present or not...

More information:

http://search.cpan.org/~nwclark/perl-5.8.9/lib/File/Basename.pm


#
# And finally, one more Perl call from the editors:
#

use FindBin;
use lib "$FindBin::Bin/../lib";
$bindir = "$FindBin::Bin";


Q: I saved an important file in the local /scratch directory on one of
   20+ Linux workstations, but I don't remember which one.  The file
   name is "coastline.inp", and it may or may not be in a subdirectory.
   Since the /scratch directory is not shared between the workstations,
   I need to find the specific machine that has this file.  With so many
   workstations available, what's the most efficient way to determine
   which workstation has the file I need?
   

[[ Answers, Questions, and Tips Graciously Accepted ]]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top