ARSC HPC Users' Newsletter 383, March 21, 2008

Iceberg, ARSC IBM P6X System, Retirement in July

During the 2008 Fiscal year ARSC will begin installing a Cray XT5 as a replacement for Iceberg, the allocated IBM P655+/P690+ system. Due to constraints outside of our control, we will be unable operate both Iceberg and the new system simultaneously.

Iceberg will remain available until 1 PM AKDT July 18th, 2008. The contents of $HOME and $WORKDIR on iceberg will not be saved. If you wish to keep the contents of these directories please copy the files to $ARCHIVE_HOME prior to July 18th.

Additional reminders will be sent as the iceberg retirement date nears.

If you have questions about this change, please direct them to consult@arsc.edu .

ARSC XT5, "Official" Press Release

If you'd like to read our official press release on the Cray XT5 system, it is here:

http://www.arsc.edu/news/pingo.html

Regular Expressions with Python Flavor

[By: Anton Kulchitsky]

Python programmers often avoid using regular expressions (regexps), and this is a pity because regexps can simplify many difficult programming tasks. This article describes a couple of "pythonish" tricks which make it easier to work with regular expressions.

To use regular expressions, one needs to import the "re" module:

import re

This article assumes that you know the basic syntax of Python and can write programs in it. It is also supposed that you are familiar with regular expressions. Tricks in this article were not found on the internet or in books. As usual with regexps, it may be not very easy to get into the ideas, so fasten your belts. :)

  1. Raw Strings

    One reason Python programmers may avoid regular expression is that Python strings replace backslashes. For example, 'Example\tline\n' represents the words "Example" and "line" separated by tabulation and ended by a new line sign. To explicitly write this line in Python, one needs to duplicate the backslashes: 'Example\\tline\\n'. This looks ugly, especially when you write regular expressions with many backslashes in them.

    To alleviate this problem, Python provides raw strings. Backslashes are not interpreted in such strings.

    Raw strings start with 'r'. For example, r'Example\tline\n' means exactly what is written, no backslashes are replaced by special characters. Thus, it is a good idea to use raw strings whenever you use regular expressions.

    It is worth noticing that standard Python substitutions actually work with raw strings as with regular strings. For example:

    
      r'Example %d' % (1)
    
    is equivalent to
    
      r'Example 1'
    
    Read the Python documentation for more information about raw strings.
  2. Compilation of regular expressions and re.VERBOSE

    Another possible reason why *any* programmer may avoid using regular expressions is that, once written, they can be very hard to understand. Indeed, what is this?

    
      r'(\d{3})\D*([0-9]{3})\D*(\d{4})\D*(\d{4})\D*(\d*)$'
    
    And why does it parse telephone numbers?

    With some experience you will read this pretty quickly. But how about something more complicated? Even a Perl hacker who has written thousands of regexps can be stuck for an hour parsing a 3-line regexp he or she wrote a week ago. Like perl, however, Python supports multiline regexps which are much easier to read and may be documented in-line.

    In Python, enable multiline raw strings and compile them with the "re.VERBOSE" flag. re.VERBOSE makes the following changes:

    1. whitespace in regexp that is not inside a character class is ignored.
    2. comments can be placed in regular expressions starting with '#' sign like any Python comments.

    It is also a good practice to compile regular expressions. I suggest you always compile regular expressions even when you do not use re.VERBOSE.

    Again, check the Python docs for more information.

  3. Using "groups" in regexps to get parsed dictionaries of data

    Python expands on Perl's extension syntax. If the first character after the question mark is a "P", then this is an extension specific to Python. Currently there are two such extensions: (?P<name>...) defines a named group, and (?P=name) is a back-reference to a named group.

    Using named groups is similar in concept to setting variables. When the object is matched, one may use the groupdict() method to produce a dictionary of all groups with keys matching the names of groups. The beauty of this approach will be seen in an example below. Also note the readability of such regular expressions, especially when many groups are used within regexp.

  4. Lists and generators of matching objects

    Anything in Python is an object. Any objects can be united in lists. You know that re.match or re.find returns either None or a matching object. You can use this for making a list of such objects, filter only matched objects and then use list comprehensions to get your results. Using these techniques makes the code look very clear.

    Suppose we need to parse a file which contains United States phone numbers and output every number found. We use the following regexp for this (in re.VERBOSE representation):

    
    PHONE_REGEXP = \
                 r'''
                 (?P<area_code>\d{3})  # area code is 3 digits (e.g. '800')
                 \D*                   # optional separator is any number of non-digits
                 (?P<trunk>\d{3})      # trunk is 3 digits
                 \D*                   # an opt. separator
                 (?P<tail>\d{4})       # the rest of number is 4 digits
                 \D*                   # another separator
                 (?P<extension>\d*)    # optional extension
                 '''
    
    Let us compile it:
    
      phone_re = re.compile( PHONE_REGEXP, re.VERBOSE )
    

    Now, let us save the following text in the file, 'phones.txt', as an example:

    
    ----------- phones.txt ------------
    1-800-123-1233
    This is not a number
    Work: (123) 321-1232 # 123
    Home: 1-123-132-0091 call John Smith Jr.
    This is a comment with some numbers: 4343 4342 2234
    ---------- end of file ------------
    
    Now consider the following list:
    
      phones_re_lst = [ y for y in [ phone_re.search(x) for x in open('phones.txt') ] if y ]
    
    This list will contain all matching objects from lines of 'phone.txt' that contain a telephone number. Well, it can be written even more elegantly using generator expressions. Indeed, we do not need to produce all this objects explicitly and may generate them lazily only when we need them. Thus, we may write:
    
      phones_re_gen = ( y for y in ( phone_re.search(x) for x in open( 'phones.txt') ) if y )
    
    Now, let us demonstrate the power and beauty of grouping with names.

    Suppose we would like to print all the parsed telephone numbers in the consistent format, "(xxx) xxx-xxxx # xxx". Well, nothing could be simpler, as demonstrated by the following lines of code:

    
    for tel in phones_re_gen:
        print '(%(area_code)s) %(trunk)s-%(tail)s # %(extension)s' % tel.groupdict()
    
    When we run this, we get the following:
    
    (800) 123-1233 # 
    (123) 321-1232 # 123
    (123) 132-0091 # 
    
    (HINT: use the find method of the compiled regular expression object if matching the entire string is not what you want. It is usually much more effective to use find for searching than match with something like '.*' at the beginning of regexp. This is because '.*' is much harder to optimize than a more strict condition.)
  5. Writing utilities that work with input streams

    It is not difficult to write programs in Python that will work with input streams. Thus, one can write utilities that work like Unix utilities written in Perl or C. You just need to use sys.stdin for input instead of a file or open expressions (see the example below).

    Let us write a complete utility that will get a stream with phone numbers, as in the 'phones.txt' file, and produce the formatted output. Below is the code of the phoneparser.py program:

    
    #!/usr/bin/env python
    
    #
    # 'phoneparser.py' parses the phones from the input stream
    #
    
    import re, sys
    
    # a regular expression that parses phone numbers
    PHONE_REGEXP = \
                 r'''
                 (?P<area_code>\d{3})  # are code is 3 digits (e.g. '800')
                 \D*                   # optional separator is any number of non-digits
                 (?P<trunk>\d{3})      # trunk is 3 digits
                 \D*                   # an opt. separator
                 (?P<tail>\d{4})       # the rest of number is 4 digits
                 \D*                   # another separator
                 (?P<extension>\d*)    # optional extension
                 '''
    phone_re = re.compile( PHONE_REGEXP, re.VERBOSE )
    
    # a generator expression that generates all matching objects from
    # input stream
    phones_re_gen = ( y for y in ( phone_re.search(x) for x in sys.stdin ) if y )
     
    # printing everything in the desired format, using dictionary substitution
    for tel in phones_re_gen:
        print '(%(area_code)s) %(trunk)s-%(tail)s # %(extension)s' % tel.groupdict()
    

    Change the permission of the file to make it executable:

    
       chmod +x ./phoneparser.py
    
    Execute the utility:
    
       cat phones.txt 
    ./regexps.py
    
    And we get the same output as above:
    
    (800) 123-1233 # 
    (123) 321-1232 # 123
    (123) 132-0091 # 
    

Open Source Graphics Applications

[By: Kate Hedstrom]

All images that get displayed by computer are eventually drawn bit by bit. This happens whether on a 600 dots-per-inch (dpi) printer or on a 90 pixels per inch monitor. However, there are two main ways in which images can be stored - as pixels (a raster image) or as vectors (vector graphics). The preferred format for any given image will depend on what sort of image it is. A thorough description can be found here:


http://en.wikipedia.org/wiki/Graphics_file_format
There are two primary open source tools corresponding to these representation types.

GIMP

If you have raster images to modify/create, the program you need is GIMP, the GNU Image Manipulation Program. It can be found at:

http://www.gimp.org/

This is the open source answer to Photoshop and is now a very mature program. There are books and tutorials about it, listed at the main GIMP site above.

Inkscape

If you prefer vector graphics, there is now Inkscape, the open source answer to CorelDraw and Illustrator:

http://www.inkscape.org/

It too has tutorials and at least one book.

Which To Use?

It depends on what you are doing!

If it's photos, GIMP is the one for you. For drawing new graphics, I prefer Inkscape. One difference is that since Inkscape thinks of a line as an editable line object, you can move the ends around or move the bezier control points on curved lines. Once a line is drawn in GIMP, it gets transformed to pixels and is no longer a line as such.

Features of Both

I believe that both GIMP and Inkscape share some underlying code. Both have the same concept of layers, where you can think of your image as a stack, each layer being visible or invisible, locked or unlocked, and the ability to move them up and down in the stack. Layers were new to me with my xfig background, but I now find them very handy.

One implication of the layers is that you need a way to save files that maintains the layer structure. GIMP has its own file format with a .xcf extension. Inkscape's native files are .svg which is a vector graphics file standard. Both can read and write a variety of formats, including Inkscape on Linux being able to write native GIMP files.

Summary

GIMP is much older of the two and I've been slowly learning how to use it for years. For drawing where you want to edit the elements, I've used xfig for even longer, so was very excited to learn about Inkscape, which brings more modern drawing concepts to open source vector drawing.

Mackey wins both the Yukon Quest and Iditarod, Two Years in a Row

I learned Tuesday night (from a restaurant place mat) that Alaska actually has a "state sport." No, it's not kayaking, it's dog mushing.

Everyone's heard of the second most challenging dog-sled race, the 1150 mile "Iditarod Trail Sled Dog Race." The most challenging is often considered to be the "Yukon Quest International Dog Sled Race," which runs 1000 miles, from Fairbanks to Whitehorse, Yukon Territory (and the other direction, in odd years).

Until last year, it was unheard of to win both races in the same year.

Mushers and dogs get only two weeks or so to recover from the first race (which takes about 10 days for the fastest teams), fix the equipment, buy more dog food, and transport everything and everybody to the start of the next race. Most people don't even try it.

By winning these back-to-back races two years in a row, Lance Mackey is becoming something of the Lance Armstrong of dog mushing. He's a cancer survivor. Got the same first name. And seems to be accomplishing the "impossible."

What does this have to do with ARSC? Nothing but geography...

Quick-Tip Q & A


A:[[ I have an executable that uses shared libraries.  Is there a way 
  [[ to show the which shared library each symbol is provided by?      


# 
# Here's a solution from Randall Hand.
#


Unfortunately, I'm not aware of any single Unix command capable of
getting this information.  However, it's not very difficult to throw
together a shell script connecting a few other utilities to give you
something like what you want.  A shell script like the following:

        #!/bin/sh
        LOOKUP=/tmp/lookup-$PPID
        rm -f $LOOKUP

        for LIBRARY in $@; do
            echo -ne "\rScanning $LIBRARY...               "
            nm --defined --print-file-name $LIBRARY 2> /dev/null >> $LOOKUP
        done
        echo -e "\nAnalyzing symbol tables..."
        for SYMBOL in `nm -Du "$1" 
 cut -b 20-`; do
            echo \-\-\-\-\-\> $SYMBOL
            grep " $SYMBOL\$" $LOOKUP
        done

        rm $LOOKUP

This script accepts several arguments. The first argument is the
File to analyze, and the remainder of the arguments are libraries
to check for the symbols in (or directories containing the symbols).
Running it on Amethyst as so:

        findlib.sh ~/local/lib/ezViz/libezViz.so ~/local/lib/*.so
~/local/lib64/* /usr/lib/*

Dumps out several screens of information such as:

        -----> strrchr
        /usr/lib/libc.a:strrchr.o:0000000000000000 T strrchr
        -----> strstr
        /usr/lib/libc.a:strstr.o:0000000000000000 T strstr
        -----> strtok
        /usr/lib/libc.a:strtok.o:0000000000000000 T strtok
        -----> syslog
        /usr/lib/libc.a:syslog.o:0000000000000684 T syslog
        -----> time
        /usr/lib/libc.a:time.o:0000000000000000 T time
        -----> vsprintf
        /usr/lib/libc.a:iovsprintf.o:0000000000000000 W vsprintf
        -----> vtk_netcdf_nc_get_att_text
        /viz/home/rhand/local/lib/libvtkNetCDF.so:000000000000b620 T vtk_netcdf_nc_get_att_text
        -----> vtk_netcdf_nc_get_var_float
        /viz/home/rhand/local/lib/libvtkNetCDF.so:000000000002b250 T vtk_netcdf_nc_get_var_float
        -----> vtk_netcdf_nc_get_vara_float

Showing first the symbol found in the first file (libezViz.so in this
case), and then all the libraries that define that symbol.

While there's no guarantee that that particular file is the one that
will be used at runtime, that can be somewhat discovered by using your
LD_LIBRARY_PATH as your arguments.


#
# Editor's solution
#

This was a bit more complicated than I thought it would be.  I really
thought there would be a command to do this!

Here's the basic process I used to solve this:

  1) Find each unresolved symbols in an executable using nm.
  2) Find each shared library referenced by the executable using ldd.
  3) Search for each unresolved symbol in each shared library using
     objdump on the shared library.

Here's a script which performs these operations here:

  
http://people.arsc.edu/~bahls/code/find_shared


It's a bit too long to put in the newsletter.  Here's some sample
output:

  mg56 % find_shared ~/mpi/hello
  Unresolved symbols for '/u1/uaf/bahls/mpi/hello'
   + MPI_Finalize found in /usr/local/pkg/voltairempi/voltairempi-S-1/mpi.pathcc.rsh/lib/shared/libmpich.so.1.0
   + MPI_Get_processor_name found in /usr/local/pkg/voltairempi/voltairempi-S-1/mpi.pathcc.rsh/lib/shared/libmpich.so.1.0
   + MPI_Init found in /usr/local/pkg/voltairempi/voltairempi-S-1/mpi.pathcc.rsh/lib/shared/libmpich.so.1.0
   + MPI_Comm_rank found in /usr/local/pkg/voltairempi/voltairempi-S-1/mpi.pathcc.rsh/lib/shared/libmpich.so.1.0
   + printf found in /lib64/tls/libc.so.6
   + __libc_start_main found in /lib64/tls/libc.so.6
  



Q: Here's one proposed a long time ago by friend and former co-editor,
   Guy Robinson.

   There are some interesting collective nouns in the English language.
   E.g., a "pride" of lions, a "knot" of toads.

   The question is, what collectives would you propose for terms in the
   supercomputing vernacular?  For instance, we want collectives for
   terms such as cores, nodes, infinite loops, bugs, compiler options,
   defunct HPC vendors, ARSC consultants, irrelevent benchmarks, etc.,
   etc.  Where there is a known collective, maybe you can improve on it.
   This should be interesting... maybe even fun.

[[ Answers, Questions, and Tips Graciously Accepted ]]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top