ARSC T3D Users' Newsletter 88, May 24, 1996
Current Status of Scalapack
The following article fortuitously appeared in this week's edition of NA digest:
> From: Jack Dongarra <dongarra@cs.utk.edu> > Date: Sat, 18 May 1996 10:06:15 0400 > Subject: ScaLAPACK Software Update > > The ScaLAPACK project is a collaborative effort between: > Oak Ridge National Laboratory Rice University > University of Tennessee University of California, Berkeley > University of California, Los Angeles University of Illinois > > The ScaLAPACK project is made up of 4 components: > dense matrix software (ScaLAPACK) > large sparse eigenvalue software (PARPACK) > sparse direct systems software (CAPSS) > preconditioners for large sparse iterative solvers (PARPRE) > > ScaLAPACK, version 1.2, includes routines for the solution of linear > systems of equations, symmetric positive definite banded linear systems > of equations, condition estimation and iterative refinement, for LU and > Cholesky factorization, matrix inversion, fullrank linear least > squares problems, orthogonal and generalized orthogonal factorizations, > orthogonal transformation routines, reductions to upper Hessenberg, > bidiagonal and tridiagonal form, reduction of a symmetricdefinite > generalized eigenproblem to standard form, the symmetric, generalized > symmetric and the nonsymmetric eigenproblem. > > Software is available in single precision real, double precision real, > single precision complex, and double precision complex. The software has > been written to be portable across a wide range of distributedmemory > environments such as the Cray T3, IBM SP, Intel series, TM CM5, > clusters of workstations, and any system for which PVM or MPI is available. > A draft ScaLAPACK Users' Guide and a comprehensive Installation Guide is > provided, as well as test suites for all ScaLAPACK, PBLAS, and BLACS routines. > > Future releases of ScaLAPACK will include routines for the solution of > general banded linear systems, general and symmetric positive definite > tridiagonal systems, rankdeficient linear least squares problems, > generalized linear least squares problems, and the singular value > decomposition. Also available will be the fullindex PBLAS, which will > have the majority of alignment restrictions removed, as well as the > ability to operate on partial first blocks. The next release of > ScaLAPACK is slated for Fall, 1996. > > PARPACK (Parallel ARPACK) is an extension of the ARPACK software package > used for solving large scale eigenvalue problems on distributed memory > parallel architectures. The message passing layers currently supported > are BLACS and MPI. Serial ARPACK must be retrieved and installed prior to > installing PARPACK. All core ARPACK routines are available in single > precision real, double precision real, single precision complex, and double > precision complex. An extensive set of driver routines are available > for ARPACK and a subset of these are available for parallel computation with > PARPACK. These may be used as templates that are easily modified to construct > a problem specific parallel interface to PARPACK. > > CAPSS is a fully parallel package to solve a sparse linear system of the form > Ax=b on a message passing multiprocessor; the matrix A is assumed to be > symmetric positive definite and associated with a mesh in two or three > dimensions. This version has been tested on the Intel Paragon and makes > possible efficient parallel solution for several right hand side vectors. > > PARPRE is a package of parallel preconditioners for general sparse > matrices. It includes classical point/block relaxation methods, > generalized block SSOR preconditioners (this includes ILU), and domain > decomposition methods (additive and multiplicative Schwarz, Schur > complement). The communication protocol is MPI, and low level > routines from the Petsc library are used, but installing the complete > Petsc library is not necessary. > > Funding for this effort comes in part from DARPA, DOE, NSF, and CRPC. > For more information on the availability of each of these packages and > their documentation, consult the scalapack index on netlib. > The URL is: > http://www.netlib.org/scalapack/index.html > or type: > echo "send index from scalapack" mail netlib@www.netlib.org > > Comments/suggestions may be sent to scalapack@cs.utk.edu. > > Regards, > Jack Dongarra >
BLACS, PBLAS and Scalapack
Using the Netscape search tools, I was able to find one T3D example using scalapack routines at:http://patpwww.epfl.ch/users_guide/home.html
This page is the "User's Guide for the T3D at EPFL". In English, EPFL is the Swiss Federal Institute of Technology  Lausanne. There is a lot of good T3D information at this address, besides this scalapack example.I took the example given by Michel Roche of the Cray Research EPFLPATP support group, put in some timing calls and ran the example for various problem sizes. The program is set up so that the n$pes = nprow * npcol and npcol = nprow, where:
n$pes = the number of processors in the T3D configuration nprow = the number of processors in the x direction npcol = the number of processors in the y directionthis is the same as: n$pes = nprow**2 = npcol**2. I think this corresponds to the CRAFT Fortran declaration of:
CDIR$ SHARED A( :block(n$pes/2), :block(n$pes/2) )On ARSC's 128 PE machine this means the program can be run on only
4 PEs which is a 2 * 2 processor grid 16 PEs which is a 4 * 4 processor grid 64 PEs which is a 8 * 8 processor gridThe program uses three of the PBLAS routines:
PSGETRF to perform an LU factorization on a general dense matrix PSGETRI to invert a general dense matrix, using the above LU factorization PSGEMM for dense matrix multiplicationfor which there are extensive man pages on denali. The example program doesn't work for all problem sizes that I tried; the table below shows the timings for some of the cases that did work. My modified version is listed at the end of this newsletter.
Table 1
Times (seconds) for PBLAS routines on square matrices on ARSC's T3DProblem size PSGETRF PSGETRI PSGEMM 4PEs 16PEs 64PEs 4PEs 16PEs 64PEs 4PEs 16PEs 64PEs 100x100 .12 .19 .08 .03 .02 .01 200x200 .24 .29 .25 .07 .10 .02 400x400 .63 .54 1.01 .27 .67 .14 800x800 2.50 1.25 5.47 1.21 5.11 .86 1000x1000 4.32 1.75 1.60 9.81 2.08 2.28 9.66 1.58 .40 1600x1600 15.09 4.09 35.40 6.84 39.92 5.98 2000x2000 28.44 6.59 4.25 66.58 12.45 6.97 78.01 11.67 2.77 3000x3000 91.90 17.19 213.92 37.99 263.64 38.28 3200x3200 20.29 45.62 45.96 4000x4000 ### 36.55 ### 85.85 ### 90.59 5600x5600 92.91 32.69 223.60 70.03 244.70 55.51 6000x6000 112.20 275.85 301.77 6400x6400 ### ### ### 8000x8000 12800x12800 ### ### ### where ### means ran out of memory, blank means operand range errorLooking over the code, I don't see why some cases fail. I believe these PBLAS routines in libsci haven't been checked out completely. But like Pete Rigsbee of Cray Research said at the Pittsburgh "Meeting on the Optimization of Codes for the CRAY MPP Systems":
"All T3D bugs will be fixed in the T3E".
Onward and Upward!> program ScaLAPACK > c ScaLAPACK example. Author: Michel Roche, Cray Research, PATPEPFL > c This example uses 3 different ScaLAPACK routines: > c PSGETRF to perform an LU factorization on a general dense matrix > c PSGETRI to invert a general dense matrix > c PSGEMM for dense matrix multiplication. > c The matrix is supposed to be a square nxn matrix. > c The processor grid is also supposed to be squared (i.e. nprow = npcol > c so that n$pes = nprow**2 = npcol**2) > c The user has to choose the value of n (matrix size), > c the processor grid size, e.g. on 64 PEs: nprow=npcol=8, > c and the blocking factor size: nb (see man pages for details) > c > c Parameters that the user has to choose. > parameter(n=800, > & nb=16, ! or nb = 4, ! or nb = 64, > & nprow=4, ! or nprow = 2, ! or nprow = 8, > & npcol=4) ! or npcol = 2) ! or npcol = 8) > c Notes: 1. nprow=npcol is required in this example. > c 2. Depending on the size of "n", the code can crash if > c the number of PEs used is too small > c > c Parameters computed for reservation of "work space" > parameter(minb=nb, > & maxb=nb, > & minp=nprow, > & maxd=n, > & length=4*maxb*(((maxd/minb)/minp)*maxb+maxb)*8) > c Note: length has to be changed depending on the ScaLAPACK routine > c used > integer buff00(length/8) > external blacs_gridinit,blacs_gridinfo,numroc > * > c Initialization of the PBLAS internal buffer. This routine > c must be called > call blacs_pinfo( iam, nprocs ) > call initbuff(buff00,length) > c Processor grid initialization.This routine must be called > call blacs_gridinit(ictxt, 'r', nprow, npcol) > c Information on the processor grid. > call blacs_gridinfo( ictxt, nprow, npcol, myrow, mycol ) > > c Computing matrix leading dimension and additional workspace > locpn = numroc(n,nb,myrow,0,nprow) > locqn = numroc(n,nb,mycol,0,npcol) > > lda_x = locpn > lda_y = locqn > > lwork = locpn * nb > liwork = locqn + nb > > call barrier() > if( iam .eq. 0 ) then > write(6,*)'mycol,myrow,lda,length,lwork,liwork n ',mycol,myrow, > & lda_x,lda_y,length,lwork,liwork, n > endif > > c Call to the main routine > call compute(ictxt,n,nb,nprow,npcol,lda_x,lda_y,lwork,liwork) > > call barrier() > if( iam .eq. 0 ) write(6,*)'compute done' > end > > > > > > subroutine compute(ictxt,n,nb,nprow,npcol,lda_x,lda_y, > & lwork,liwork) > c a is the matrix to invert > real a(lda_x,lda_y), work(lwork) > c array b and c are defined for testing purpose: > c b = a, so that c = b*(inv(a)) must be identity. > real b(lda_x,lda_y),c(lda_x,lda_y) > c descriptor vectors for descinit routine > integer desca(8),descb(8),descc(8) > integer ipiv(lda_x+nb), iwork(liwork) > integer info > > external sgen1, psgetrf, psgetri, psgemm, descinit > * > call blacs_pinfo( iam, nprocs ) > call sgen1(ictxt,n,n,nb,nb,a,lda_x,lda_y) ! generate a matrix > call barrier() > if( iam .eq. 0 ) write(6,*)'sgen done ' > c b = a, for testing purpose > > b = a > > c DESCINIT initializes a descriptor vector that provides ScaLAPACK routines > c with the information about the distribution attributes of a > c distributed array. Users are encouraged to use DESCINIT to initialize the > c descriptors as this may shield their code from future > c modifications to the descriptor. > > call descinit(desca(1),n,n,nb,nb,0,0,ictxt,lda_x,info) > call descinit(descb(1),n,n,nb,nb,0,0,ictxt,lda_x,info) > call descinit(descc(1),n,n,nb,nb,0,0,ictxt,lda_x,info) > > c call to PSGTRF: LU factorization > t1 = second() > call psgetrf(n,n,a,1,1,desca,ipiv,info) > write(6,*)'psgetrf done ', iam, info, second()  t1 > > c call to PSGTRI: Matrix inversion > t1 = second() > call psgetri(n,a,1,1,desca,ipiv,work,lwork,iwork,liwork,info) > call barrier() > if( iam .eq. 0 ) write(6,*)'psgetri done ',info,second()  t1 > > cdir$ barrier > c Compute a*b = c > t1 = second() > call psgemm('n','n',n,n,n,1.,a,1,1,desca(1),b,1,1,descb(1), > & 0.,c,1,1,descc(1)) > call barrier() > if( iam .eq. 0 ) write(6,*)'psgemm done ', second( )  t1 > > c Check that c = identity matrix > call test(ictxt,n, n, nb, nb, c, lda_x,lda_y) > end > > > > subroutine sgen1(ictxt,m, n, nb1, nb2, a, lda_x,lda_y) > > c m : number of rows in the global matrix > c n : number of cols in the global matrix > c nb1 : first dimension of the block in the blockcyclic distribution > c nb2 : second dimension of the block in the blockcyclic distribution > c a : distributed array > c lda_x,lda_y : leading dimensions of the LOCAL part of the distributed array. > > c > real a(lda_x,lda_y) > > C > > call blacs_gridinfo( ictxt, nprow, npcol, myrow, mycol ) > > do i =1,lda_x > do j =1,lda_y > a(i,j) = 1./(myrow*lda_x+i+mycol*lda_y+j1) > enddo > enddo > > if(mycol.eq.myrow) then > do i =1,lda_x > a(i,i) = i > enddo > endif > > return > end > > > > subroutine test(ictx,m, n, nb1, nb2, a, lda_x,lda_y) > > c m : number of rows in the global matrix > c n : number of cols in the global matrix > c nb1 : first dimension of the block in the blockcyclic distribution > c nb2 : second dimension of the block in the blockcyclic distribution > c a : distributed array > c lda_x,lda_y : leading dimensions of the LOCAL part of the distributed array. > > c > real a(lda_x,lda_y) > > C > call blacs_gridinfo( ictx, nprow, npcol, myrow, mycol ) > > ttt = 0. > do i=1,lda_x > do j=1,min(lda_y,i1) > ttt = ttt+abs(a(i,j)) > enddo > enddo > do i=1,lda_x > do j=i+1,lda_y > ttt = ttt+abs(a(i,j)) > enddo > enddo > > if(mycol.eq.myrow) then > do i =1,lda_x > ttt = ttt + abs(a(i,i)  1.) > enddo > endif > > c write(6,*)'test ',mycol,myrow,ttt > > return > end
A Call for Material
If you have discovered a good technique or information on the T3D and you think it might benefit others, then send it to the email address above and it will be passed on through this newsletter.Current Editors:
Email Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 9074508669 Kate Hedstrom ARSC Oceanographic Specialist ph: 9074508678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 997756020

Subscribe to (or unsubscribe from) the email edition of the
ARSC HPC Users' Newsletter.

Back issues of the ASCII email edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.