ARSC HPC Users' Newsletter 217, April 5, 2001
SV1e Upgrade Postponed
Chilkoot's processor upgrade to an SV1e has been temporarily postponed.
Chilkoot users should watch the MOTD ("message of the day" -- that valuable information that appears every time you log on), for an announcement of the updated schedule. The upgrade will likely occur during a normal Wednesday night downtime. We regret this delay, but expect to proceed with the upgrade soon.
[Thanks to Jeff McAllister of ARSC for this contribution.]
Depending on the system, MPI might not provide the best performance or the easiest way to parallelize codes. However, MPI is emerging as just about the only common element between HPC architectures (i.e. it will work on shared AND distributed memory platforms). As ARSC now has several systems on which to test MPI codes, it seems only natural to seek the common elements between the various implementations.
I set a goal to write a platform-independent example that could pass arbitrarily large array sections in a way that could be expanded to many variables. Basically, I was trying to produce a strategy for a pretty common case that I could reuse. In the past I'd written many "toy" MPI programs and played with lots of examples. I'd even gone through the effort of re-engineering larger scientific codes for MPI parallelism. (One works on the T3E, but it's not truly portable yet. The other still doesn't work correctly in its parallel version.) So I thought aha! Figuring out which MPI routines are actually going to work portably and scalably will be valuable knowledge...
I'm not going to say that I'm an MPI authority -- the purpose of this article is merely to communicate what I've observed. Someone with more MPI programming experience, or even a different programming style, may be able to come up with another (and perhaps better) set of MPI commands and strategies that work everywhere.
Testing the Standard-ness of MPI
Just because a program uses MPI doesn't mean it's portable. As there are differences in compilers, there are differences in MPI implementations. My theory is that some ways of programming in MPI are "more portable" than others. (High-level languages have the same problem. For instance, there are lots of ways to use common blocks in Fortran in completely nonportable ways.) So it makes sense to learn the dangerous parts of MPI and avoid using them.
The requirements of my portability project:
- transfer any part of an array between two processors
- this transfer should work regardless of size without changing code
- on a domain decomposed into arbitrarily-sized subsections
Furthermore, this approach should be general enough to work in a wide variety of settings. For example, one application I'm hoping to put this to is transferring the boundaries for a finite-difference solver, though I'd like to be able to transfer the entire array partition in other cases. The goal is to be able to figure out the most generic way to transfer data in MPI so that I can practice code reuse, avoiding re-inventing the wheel for each new application.
Delving into this project further, some suspicions I'd had from earlier work were confirmed:
The most basic subset of MPI subroutines is the best bet for portability: init, comm_rank, comm_size, send, recv, isend, irecv, wait, barrier, and finalize. (wtick and wtime may also be on this list though in testing how widely the granularity varies between systems suggests another project...)
probe and waitall, though they would be very nice to have, are unfortunately not on the above list (along with testall,testany, waitany, iprobe, and the more esoteric variations of send/recv. The way these function seems to vary widely between systems.)
Directly sending/receiving array subsections (i.e. mpi_send(A(1:mj,kstart(mype):kend(mype)... ) doesn't work reliably. On some platforms/compilers this will work, on others this is an unstable option. The alternative -- which does work everywhere -- is copying from a buffer into the desired array subsection. An additional step of allocating buffers to the exact count specified in the MPI call also seems to be required for full portability.
After testing many approaches to fulfill my requirements on ARSC's Cray T3E & SV1, IBM SP, Linux cluster, and SGI Origin 3800, I arrived at the following code. Here's a quick summary of how it works:
A "partition table" is created to drive domain deconstruction based on an arbitrary array size and an arbitrary number of processors. It's crude: # of procs can't be greater than # of columns and load balancing isn't considered.
Each processor iterates over its own section of the array in the "computational kernel".
Depending on where you end the outer loop, A) all computations can happen for each processor's subsection with one transfer at the end or B) the entire array subsection can be transferred from each processor each time. The result should be the same, but the second option is a stress test for the message-passing strategy.
Message passing is accomplished by a 2-message scheme. The first message is short, containing information about what to do with the next (longer) message. This depends on the "messages arrive in order from a particular source" property of the standard, which seems to be preserved on the systems listed, even under stress tests with tens of thousands of loops.
Data is received in a temporary buffer created at exactly the right size for the message then copied into the target array before being deallocated.
program gridMPI implicit none include 'mpif.h' integer mj,mk real,dimension(:,:),allocatable:: A real,dimension(:),allocatable::sendbuffer,recvbuffer real x,y,z,minv,rad,inc,iso integer status(MPI_STATUS_SIZE) ! some MPI implementations don't deal ! well with MPI_STATUS_IGNORE integer I,K,J,L,numloops,error integer mype,totpes,ierr,master,tag,source,sz,tmp integer req1,req2,req3,req4 integer msg(4) integer, DIMENSION(:),ALLOCATABLE ::Kstart,Kend character, dimension(:),allocatable::outline ! setup mpi call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, mype, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, totpes, ierr) numloops=100 master=0 mj=60 mk=mj ALLOCATE(outline(mj)) ALLOCATE(Kstart(0:totpes-1)) ALLOCATE(Kend(0:totpes-1)) c create a partition table mapping out the subsections c to be handled by each processor CALL PartitionTable(1,mk,totpes,0,Kstart,Kend) print *,"PE ",mype,":",Kstart(mype),Kend(mype) ALLOCATE(A(mj,mk)) A=0 c a sphere equation solver. Put here to use up some time and c generate an easily verifiable set of values to transfer c Output should reflect the heights of a hemisphere which c changes size with each timestep. Output can be checked c visually or vs scalar output (see below to disable MPI c execution). As the sphere is slightly off-center, testing output c will show transposition errors or missing communications from other c processors. rad=1.0 inc=.3 call mpi_barrier(MPI_COMM_WORLD,ierr) do L=1,numloops rad=rad+inc if ((rad>real(mj*2)).or.(rad<1.0)) inc=inc*(-1.0) do k=kstart(mype),kend(mype) do j=1,mj x=real(k) y=real(j) z=(rad**2-(x-mk/2)**2-(y-mk/2)**2) if (z>0) then z=sqrt(z) else z=0 end if a(j,k)=z/1000 end do end do end do c all send their section of the array to master c set this to .false. to turn off message passing if (.true.) then ! yes, some of this information is redundant ! and could be obtained from status msg(1)=mype msg(2)=kstart(mype) msg(3)=kend(mype) msg(4)=mj tag=1 call MPI_ISEND(msg,size(msg),MPI_INTEGER, * master,tag,MPI_COMM_WORLD,req1,ierr) if (ierr /= MPI_SUCCESS) then print *, "initial msg send unsuccessful on PE ",mype end if sz=0 allocate(sendbuffer(mj*(kend(mype)-kstart(mype)+1)), * stat=error) if (error /=0) then print *,"error allocating sendbuffer in loop ",L end if do k=kstart(mype),kend(mype) do j=1,mj sz=sz+1 sendbuffer(sz)=a(j,k) end do end do tag=999 call MPI_ISEND(sendbuffer,sz,MPI_real, * master,tag,MPI_COMM_WORLD,req2,ierr) if (ierr /= MPI_SUCCESS) then print *, "main buffer send unsuccessful on PE ",mype end if if (mype==master) then do i=1,totpes tag=1 call MPI_IRECV(msg,size(msg),MPI_INTEGER, * MPI_ANY_SOURCE,tag,MPI_COMM_WORLD,req3,ierr) call MPI_WAIT(req3,status,ierr) if (ierr /= MPI_SUCCESS) then print *, "init msg recv unsuccessful from PE ", * status(MPI_SOURCE) end if tag=999 source=msg(1) sz=mj*(msg(3)-msg(2)+1) allocate(recvbuffer(sz),stat=error) if (error /=0) then print *,"error allocating recvbuffer in loop ",L end if call MPI_IRECV(recvbuffer,sz,MPI_real, * source,tag,MPI_COMM_WORLD,req4,ierr) call MPI_WAIT(req4,status,ierr) if (ierr /= MPI_SUCCESS) then print *, "main msg recv unsuccessful from PE ", * status(MPI_SOURCE) end if sz=0 do k=msg(2),msg(3) do j=1,msg(4) sz=sz+1 a(j,k)=recvbuffer(sz) end do end do deallocate(recvbuffer,stat=error) if (error /=0) then print *,"error deallocating recvbuffer in loop ",L end if end do end if call mpi_wait(req1,status,ierr) call mpi_wait(req2,status,ierr) deallocate(sendbuffer,stat=error) if (error /=0) then print *,"error deallocating sendbuffer in loop ",L end if call mpi_barrier(MPI_COMM_WORLD,ierr) end if ! if you'd like to see a program with a really pathological ! comp time/comm time ratio, uncomment the next line and ! comment the last line of the "computational kernel" ! the results should be the same but execution time will ! increase significantly with each new processor c end do if (mype==master) then minv=minval(A) iso=(maxval(A)-minv)/9 do j=1,mj do k=1,mk if (A(j,k)>0) then outline(k:k)=char(int((A(j,k)-minv)/iso)+48) else outline(k:k)=" " end if end do print *,outline end do end if call MPI_FINALIZE(ierr) end SUBROUTINE PartitionTable(start,MAX,nPEs,overlap, * Pstart,Pend) IMPLICIT NONE C -------- SCALAR VARIABLES ------- integer MAX,start ! partitioned range=start:Max integer nPEs,overlap ! divided over nPEs with specified overlap C -------- DIMENSIONED VARIABLES----- integer Pstart(0:nPEs-1) ! the partition start and end vars integer Pend(0:nPEs-1) C -------- local variables ---------- integer I integer R,B integer POS,Size,Tsize ! get the base partition,B, (i.e. how this space would be partitioned ! if it divided evenly between the PEs) and the remainder(R). ! R PEs will do 1 extra loop. Tsize=(Max-start)+1 if (Tsize<nPEs) then print *,"Cannot partition: Max-start is less than nPEs" STOP end if B=Tsize/nPEs-1 R=mod(Tsize,nPEs) POS=start-1 DO I=0,nPEs-1 POS=POS+1 ! advance to next partition Size=B ! set partition size if (I<R) Size=Size+1 ! 1 extra loop for remainder PEs Pstart(I)=POS ! Set Pstart to partition start POS=POS+Size Pend(I)=POS ! Set Pend to partition end Pstart(I)=Pstart(I)-overlap ! adjust for overlaps if (Pstart(I)<start) Pstart(I)=start Pend(I)=Pend(I)+overlap if (Pend(I)>MAX) Pend(I)=MAX END DO RETURN END SUBROUTINE
MPI can be a "portable" standard, though (as with other standards like programming languages) just because it's written in MPI doesn't make a code portable. Portability can be increased by:
Testing code on a wide variety of systems during development.
Using the simplest subset of MPI possible.
As you can see from the above code, though, the results of attempting to be as general and as portable as possible are not compact, elegant, or optimal in terms of performance. We're back to the old problems of performance being at odds with portability (and good programming practice in general). However, this example (unlike many of the other example codes out there) is actually scalable to "real problem" size on many systems. This is just a beginning. It has often been said that MPI is too hard to use. Hopefully I've in some small way made it easier with this experiment.
[ Editor's note: comments and other experiences of MPI portability are invited. ]
ARSC Training, Next Two Weeks
For details on ARSC's spring 2001 offering of short courses, visit:
Final course of the semester:
Visualization with Vis5D, Part I
Wednesday, April 11, 2001, 2-4pm
Visualization with Vis5D, Part II
Wednesday, April 18, 2001, 2-4pm
Please let us know what topics you'd like us to cover next fall. Also, ARSC staff are generally available for one-on-one consultation. ARSC users may always contact firstname.lastname@example.org if you need help with programming, optimization, parallelization, visualization, or anything else. We'll try to connect you with an appropriate staff member.
Quick-Tip Q & A
A:[[ Is there a way to peek at my NQS job's stdout and stderr files [[ (.o and .e files), while the job is still running? [[ [[ I'm in debug mode, and wasting a lot of CPU time because these jobs [[ must run to completion before I can see ANY output. Sometimes, I [[ could qdel them based on debugging output emitted early in the run. From "man qsub", -ro Writes the standard output file of the request to its destination while the request executes. -re <ditto, for stderr> Thus, the answer is to add this to your qsub script: #QSUB -ro #QSUB -re Experiments on the T3E and SV1 reveal a minor problem, however. If I don't combine stderr and stdout (using "-eo") and do specify "-re", the stderr file gets stored to my home directory, rather than the normal destination, the directory from which the qsub script is submitted. Everything works fine for the stdout file, whether or not "-eo" is given. Of course, you should only do read operations, like "cat," on these output files while they're still being written by the NQS job. Q: Help! I've got myself stuck on 4 (FOUR!) list servers, and can't unsubscribe! It was 3, but last month I joined a support group for people addicted to listservers! Ahhhh! That made 4. Problem seems to be I'm subscribed as (making something up here), "myname@my_old_host.frazzle.com" but they've retired "my_old_host" and moved me to a new workstation and now, when I send an "unsubscribe" message, I get a reply that, "myname@my_new_host.frazzle.com" is not subscribed How can I get my name off these lists?
[[ Answers, Questions, and Tips Graciously Accepted ]]
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.