ARSC T3D Users' Newsletter 96, July 19, 1996
ARSC's MPP Queues
ARSC's T3D has been busy: 59  77% utilized during the last 4 months. Thanks to a cooperative spirit among users, access to the big queues seems to be reasonably equitable.
This seemed, however, like a good time to restate the "gentlemen's agreement" which has helped maintain fair access. Simply put: please don't have more than one job (running or executing) in an MPP queue at a time.
The following description of this "agreement" can be found on denali by typing:
denali> news t3dbatch
Queue name Number Max jobs Time limit When queue of PEs per queue on queue is enabled m_128pe_5m 128 1 5 minutes after downtimes m_128pe_8h 128 1 8 hours Friday 6PM to Sunday 4AM m_64pe_24h 64 1 24 hours always m_64pe_10m 64 1 10 minutes always m_32pe_24h 32 1 24 hours always m_16pe_24h 16 2 24 hours always m_8pe_24h 8 2 24 hours alwaysA request made to these queues will be run as soon as enough PEs are available to satisfy the request.
Please submit/execute, at most, one job per MPP queue at a time. ARSC does not enforce this "queue ethic" (by actively blocking or deleting user jobs), but has found that users are willing to work together to provide fair access to the queues.
As an example of the "queue ethic," if user "jackalope" submitted a job to the m_64pe_24h queue, he would not submit another until the first had run to completion and thus made the queue available to other users. Meanwhile, "jackalope" could have one job in the m_32pe_24h queue, if he wanted, without violating the ethic.
If you feel that the queues are being misused, we will work actively to resolve the situation. Contact:
Tom Baring: baring@arsc.edu 9074741899
18:1 Speedup Demonstrated: Free for Using System Libraries
[ Don Morton, visiting ARSC from Cameron University, provides this article. See his contact information, below. ]Many of us have come to view the art of T3D programming as one of maximizing parallelism. We concentrate on our speedups, and if they're close to linear we may think we've done all we can to enhance performance. Obviously, we want our code to be as scalable as possible, but the bottom line is that we want greater throughput.
In many codes, the throughput can be improved by enhancing the perprocessor performance. This might make our speedups less than perfect, since communication costs and inherently sequential sections of code would take up an increasing portion of the overall processing time, but what's more important  a perfect speedup curve or greater throughput? Often, if we can enhance the throughput, we can consider solving larger problems.
Scientific codes often contain vector, vectormatrix, and matrixmatrix operations which can be greatly optimized by making use of Basic Linear Algebra Subroutine (BLAS). Although vector programmers should already be aware of this, the information is useful on nonvector processors, also.
As a simple example, let's consider some vector operations. The code which follows begins with three vectors, x, y, and z. They are initialized, then both x and y are scaled by dividing by their largestmagnitude element. Finally, z is obtained by adding x, multiplied by some factor (alpha), and y. The latter is a classic SAXPY operation (scalar alpha x plus y), and looks like
z < alpha*x + ywhere x, y, z are vectors and alpha is a scalar value. Saxpy operations are typically very efficient, as vector entries are stored in contiguous logical memory locations. Although cache replacement algorithms vary between architectures, in general, when the first entry of a vector is referenced, it, and a block of memory which follows the entry, is copied to cache under the assumption that more entries in the vector will be referenced in the near future. Thus, when the second entry of the vector is referenced it will already be in cache, providing rapid access. The lesson here is that if we can express our code so that contiguous memory locations are referenced in each loop iteration, memory access will be more efficient. Additionally, this allows us to make use of the highlyoptimized libraries for even greater performance.
The first section of the following code displays the scaling and saxpy operations discussed above. The second half of the code shows the same operations performed by calling BLAS routines ISAMAX to find the index of the largestmagnitude element, SSCAL to scale the vectors, and SAXPY to perform the saxpy operation. All of these routines are documented in man pages on denali. When compiled for the T3D with the Fortran 90 compiler, the code required 1.43 seconds to perform these operations without BLAS routines, and 0.25 seconds with BLAS. CRI has put considerable effort into optimizing these routines, so by recognizing operations which can benefit with BLAS, we can simplify our code and exhibit better perprocessor performance, which of course leads to better throughput.
program blastest parameter(isize=1 000 000, alpha=3.14) real x(isize), y(isize), z(isize) ccc initialize x and y to some random values do i=1,isize x(i) = RANF() y(i) = RANF() enddo ccccccccccccccccc NONBLAS timer1 = SECONDR() ccc scale each of these vectors by their largest absolute value index = 1 do i=2,isize if ( ABS(x(i)) .gt. ABS(x(index)) ) index = i enddo scale = 1.0/x(index) do i=1,isize x(i) = x(i)/scale enddo index = 1 do i=2,isize if ( ABS(y(i)) .gt. ABS(y(index)) ) index = i enddo scale = 1.0/y(index) do i=1,isize y(i) = y(i)/scale enddo ccc saxpy operation do i=1,isize z(i) = alpha*x(i) + y(i) enddo timer2 = SECONDR() print *, timer2  timer1, ' seconds for nonBLAS' ccccccccccccccccc Same operations as above, this time using BLAS routines ccc initialize x and y to some random values do i=1,isize x(i) = RANF() y(i) = RANF() enddo timer1 = SECONDR() ccc scale each of these vectors by their largest absolute value index = ISAMAX(isize, x, 1) scale = 1.0/x(index) call SSCAL(isize, scale, x, 1) index = ISAMAX(isize, y, 1) scale = 1.0/y(index) call SSCAL(isize, scale, y, 1) ccc saxpy operation call SAXPY(isize, alpha, x, 1, y, 1) timer2 = SECONDR() print *, timer2  timer1, ' seconds wall time for BLAS' stop endEven more improvement can be realized with matrix operations. As another simple example, let's consider matrixmatrix multiplication. We have all learned that two nxn matrices can be multiplied by placing the following statement inside a set of three nested loops, iterating on i, j, and k:
c(i,j) < c(i,j) + a(i,k)*b(k,j)No matter how we arrange the three nested loops (there are six possible ways), we will obtain the same answer. However, the time it takes to get an answer is influenced by the loop ordering.
In the code which follows, after initializing a and b with random values, and c with zeroes, I demonstrate what I call a "naive" approach  we simply code up our textbook example and get our answer.
In the next multiplication, which I call "structured," I arrange the nested loops so that I have a saxpy operation. In Fortran, which stores matrices in columnmajor ordering, the innermost loop will iterate down the columns of a and c. Because the columns will map to contiguous logical memory locations, they can be viewed just as the vectors in a saxpy operation. As the innermost loop iterates, b(k,j) is constant, and thus we can think of it as a scalar value.
In the third multiplication, the innermost loop is replaced with the BLAS SAXPY routine.
Finally, in the fourth multiplication, I make use of a BLAS3 (matrixmatrix) SGEMM routine to replace all of my loops.
program matmult parameter(isize=512) dimension a(isize, isize), b(isize, isize), c(isize,isize) ccc initialize a and b to random values, c to zero do j=1,isize do i=1,isize a(i,j) = RANF() b(i,j) = RANF() c(i,j) = 0.0 enddo enddo ccc Naive matrix multiplication timer1 = SECONDR() do i=1,isize do j=1,isize do k=1,isize c(i,j) = c(i,j) + a(i,k)*b(k,j) enddo enddo enddo timer2 = SECONDR() print *, 'Time for naive matrix multiplication: ', & timer2timer1 ccc Structured matrix multiplication ccc initialize c to zero do j=1,isize do i=1,isize c(i,j) = 0.0 enddo enddo timer1 = SECONDR() do k=1,isize do j=1,isize do i=1,isize c(i,j) = c(i,j) + a(i,k)*b(k,j) enddo enddo enddo timer2 = SECONDR() print *, 'Time for structured matrix multiplication: ', & timer2timer1 ccc BLAS1 ccc initialize c to zero do j=1,isize do i=1,isize c(i,j) = 0.0 enddo enddo timer1 = SECONDR() do k=1,isize do j=1,isize call SAXPY(isize, b(k,j), a(1,k), 1, c(1,j), 1) enddo enddo timer2 = SECONDR() print *, 'Time for BLAS1 matrix multiplication: ', & timer2timer1 ccc BLAS3 ccc initialize c to zero do j=1,isize do i=1,isize c(i,j) = 0.0 enddo enddo timer1 = SECONDR() call SGEMM('N', 'N', isize, isize, isize, 1.0, a, isize, & b, isize, 1.0, c, isize) timer2 = SECONDR() print *, 'Time for BLAS3 SGEMM matrix multiplication: ', & timer2timer1 stop endTimings for the code are as follows:
Time for naive matrix multiplication: 46.09 Time for structured matrix multiplication: 34.33 Time for BLAS1 matrix multiplication: 11.18 Time for BLAS3 SGEMM matrix multiplication: 2.52I have only scratched the surface of linear algebra optimization, but hopefully it has demonstrated the benefits of looking more closely at our operations to see how we might benefit by calling BLAS routines. With a little effort, we might find it possible to use 16 PE's where we once required 64, simply by improving the perprocessor performance.
Closing points  LAPACK routines, for the solution of various linear systems, are also available on denali. If you're worried about portability, both BLAS and LAPACK sources are available from Netlib:
http://www.netlib.org/and have been implemented on numerous systems ranging from Linux PC's to supercomputers. In fact, the use of these routines allows us to implement operations in a portable, yet often optimized, manner.
===================================================================== Don Morton Email: morton@arsc.edu Visiting Scientist (summer) Voice: (907) 4745507 Arctic Region Supercomputing Center Fax : (907) 4508601 University of Alaska Fairbanks, AK 99775 http://grizzly.cameron.edu/morton.html =====================================================================
Current Editors:
Email Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 9074508669 Kate Hedstrom ARSC Oceanographic Specialist ph: 9074508678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 997756020

Subscribe to (or unsubscribe from) the email edition of the
ARSC HPC Users' Newsletter.

Back issues of the ASCII email edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.