ARSC T3D Users' Newsletter 86, May 10, 1996
Workstation or Supercomputer?
Now that ARSC is moving to become a "cost recovery center", maybe it's time to look at the economics of using workstations versus supercomputers. I have five overhead slides from John Larson when at CSRD (Center for Supercomputing Research and Development at the University of Illinois) entitled "Which machine do I use?" that described the situation well. The slides describe a procedure for deciding between using a dedicated workstation or sharing a traditional supercomputer. Of course, if one of these machines is free to use, the decision should be easy! I'm going to simplify these slides into the ASCII format of this newsletter.
Let's suppose I have two options for solving my computer problems:
Option #1 Dedicated Workstation Relative Performance = 1 Option #2 Timeshared Supercomputer Relative Performance = 10Each of us can imagine the workstation and supercomputer that is most applicable to our own situation. On either machine we are interested in how long we as users have to wait for results. How long we wait for results will determine how many results we produce. For each machine we have:
Time(turnaround) = Time(computing) + Time(waiting)and, as a general time breakdown we have:
<Time(turnaround)> 0job finish <Time(waiting)> <Time(computing)>On the Dedicated Workstation we have:
Time(turnaround) = Time(computing) Time(waiting) = 0and, as the Dedicated Workstation time breakdown we have:
<Time(turnaround)> 0job finish <Time(computing)>In this case "dedicated" means it works on our problem alone; we don't share the Dedicated Workstation with anyone. The Expansion Factor is how long the job appears to take with respect to the time actually computing.
Expansion Factor = Time(Turnaround) / Time(computing) For a Dedicated Workstation, Expansion Factor = 1.On the Timeshared Supercomputer we have a more complicated situation:
Expansion Factor = Time(Turnaround) / Time(computing) For a Timeshared Supercomputer, Expansion Factor > 1. Expansion Factor = (may be 5 to 10) Time(waiting) = (say, 9) * Time(computing) Time(computing) = f(1/relative performance) Time(waiting) = Time(queued_to_run) + Time(swapped_out) Time(queued_to_run) = f(workload) Time(swapped_out) = f(workload, scheduling) workload = f(number_of_jobs, resource_requirements_per_job)and, as a Timeshared Supercomputer time breakdown we have:
<Time(turnaround)> 0job finish <Time(waiting)> <> Time (computing)With the Expansion Factor, we try to indicate how many users are sharing the same CPU. In the above assumption we have that 5 to 10 users are sharing the same CPU. For any computer we have:
Service = Work / Time(turnaround) Value = Service / CostJust from this specification of the problem, we have these observations:
 All other things being equal, if the turnaround times of two machines are the same, choose the cheaper machine.
 The cost paid for the Dedicated Workstation goes completely toward computing.
 The cost paid for a Timeshared Supercomputer goes partly toward computing and partly to pay to wait.
How can I get more Value from the Timeshared Supercomputer?Using the model above we have:
Time(turnaround) = Time(computing) + Time(waiting) Time(waiting) = (9) * Time(computing) Time(computing) = f(1/relative performance) Time(waiting) = Time(queued_to_run) + Time(swapped_out) Time(queued_to_run) = f(workload) Time(swapped_out) = f(workload, scheduling) workload = f(number_of_jobs, resource_requirements_per_job) Service = Work / Time(turnaround) Value = Service / CostThis sequence distills our options:
If my Cost is fixed (and nonzero), I must increase Service. If my Work is fixed, I must decrease Time (turnaround). If my Time (computing) is fixed, I must decrease Time (waiting).So to get better value from my timeshared supercomputer the conclusion is:
To decrease Time (waiting), the workload must be decreased.But the workload is controlled by the site administration! Usually the site administration is dealing with hundreds of users, with each user having only a small amount of influence. So the user who chooses to use a timeshared supercomputer is in an almost helpless position about getting his work done. How did this happen?
The core of the problem lies in a difference in expectations:
The Timeshared Supercomputer salesman said that he sold me Time(computing), when what I wanted to buy was Time(turnaround). The salesman forgot to tell me how much Time(waiting) I was getting for "free". What do I do now?There is not much that can be done by the user. If a large portion of the user's time is Time(waiting), then not even optimizing his code has much of an effect (a perverse form of Amdahl's law). Another option is to determine the expansion factor for this particular timeshared supercomputer (as approximated by wall clock .vs. cpu time) and use this term in the reevaluation of the timeshared supercomputer.
What can the Site Administration do?
Most of the options lie with site administration, but are not necessarily technical problems but policy and implementation choices:
Reduce the workload (Time(waiting))

Reduce the number of jobs
 limit eligible users
 tighten allocation policies
 restrict runs or hours used per month
 allocation  use it or lose it

Reduce resource requirements of jobs

optimize CPU performance
 training of staff and users
 use tools  preprocessors, hpm, atexpert
 identify and help critical users
 use better algorithms and software packages

optimize memory usage
 recompute rather than store
 recycle variables and workspace

optimize I/O
 IOS, SSD, memoryresident datasets
 asynchronous I/O
 use multitasking

optimize CPU performance

Reduce the number of jobs

Increase relative performance (1/Time(computing))
 Increase utilization
 Get more powerful machine (speed, bandwidth, memory, I/O)
 Realize that on a machine with a current expansion factor of 9, if Time(waiting) remains unchanged, Amdahl's Law limits the Time(turnaround) savings to 11% as Time(computing) goes to 0.
The Linpack Benchmark on the T3D (with more than one processor)
Being a "one DO loop benchmark", optimizing linpack on a multiprocessor can be easy. We just concentrate on the major loop and then when it is running well we use the same optimizations on all other loops. From last week's newsletter we know that the DO loop nest of interest is:do 30 j = kp1, n t = a(l,j) if (l .eq. k) go to 20 a(l,j) = a(k,j) a(k,j) = t 20 continue c call saxpy(nk,t,a(k+1,k),1,a(k+1,j),1) do 21 i = 1, nk a(k+i,j)=t*a(k+i,k)+a(k+i,j) 21 continue 30 continuewhere the call to the saxpy BLAS1 routine has been inlined. Also we have chosen the distribution of the two dimensional matrix for the 100x100 and 1000x1000 problems as:
100x100 problem: 1000x1000 problem: parameter( lda = 128 ) parameter( lda = 1024 ) parameter( n = 100 ) parameter( n = 1000 ) real a( lda, lda ) real a( lda, lda ) cdir$ shared a( :, :block(1) ) cdir$ shared a( :, :block(1) )This choice of declarations was made because:
 Shared arrays must follow some powerof2 restrictions (for Craft Fortran).
 Distribution by column preserves column access that is essential for a cache based processor (the T3D uses the DEC Alpha).
 Cyclic distribution of the columns provides natural load balancing during the factorization (Experience).
program main c declarations cdir$ master . ! start out uniprocessing . . cdir$ endmaster call sgefa( ! call routine that is executed on multiple processors cdir$ master call sgesl( ! return to uniprocessing . . . cdir$ endmaster end subroutine sgefa( ! here's the routine to share c declarations do 60 k = 1, n1 ! all processors cycle through major loop cdir$ master . ! PE0 the master does: . a. find pivot . b. form multipliers ... cdir$ endmaster call barrier ! sync do 30 j = kp1, n ! shared loop nest t = a(l,j) ! if (l .eq. k) go to 20 ! exchange pivoted a(l,j) = a(k,j) ! rows a(k,j) = t ! 20 continue ! c call saxpy(nk,t,a(k+1,k),1,a(k+1,j),1) ! inlined updates do 21 i = 1, nk ! a(k+i,j)=t*a(k+i,k)+a(k+i,j) ! 21 continue ! 30 continue ! 60 continue . . . endBelow is a sequence of six versions of this DO loop nest for parallelization on the T3D:
 craft  a simple Craft version from the fpp'ed source
 mod1  use the home intrinsic to distribute the updates
 mod2  use the home intrinsic to distribute the updates and rowexchanges
 mod3  use the DO loop indices to distribute the work
 mod4  use temp array to make local copy of multipliers
 mod5  call a local version of saxpy
craft  simple craft version from the fpp'ed source
Using the transformation from fpp, as shown in last week's newsletter, we break the DO loop nest into the row exchanges done on PE0 and the updates which are done as a DO shared loop. The 'doshared' construct of Craft Fortran distributes the DO 31 work among the processors:do 30 j = kp1, n t = a(l,j) temp( j ) = t if (l .eq. k) go to 20 a(l,j) = a(k,j) a(k,j) = t 20 continue 30 continue cdir$ endmaster call barrier() cdir$ doshared( j ) on a( k+i, j ) do 31 j = k+1, n c call saxpy(nk,t,a(k+1,k),1,a(k+1,j),1) do 21 i = 1, nk a(k+i,j)=temp(j)*a(k+i,k)+a(k+i,j) 21 continue 31 continue 60 continue
mod1  use the home intrinsic to distribute the updates
The 'home' intrinsic returns the PE number on which the shared array element resides. In the code below, we use this to keep most of the update DO loop 21 computation local to the PE that owns the column being updated. A shared array temp is filled with the multipliers by PE0 and then accessed by the other PEs:do 30 j = kp1, n t = a(l,j) temp( j ) = t if (l .eq. k) go to 20 a(l,j) = a(k,j) a(k,j) = t 20 continue 30 continue cdir$ endmaster call barrier() do 31 j = k+1, n if( home( a( 1, j ) ) .eq. me ) then c call saxpy(nk,t,a(k+1,k),1,a(k+1,j),1) do 21 i = 1, nk a(k+i,j)=temp(j)*a(k+i,k)+a(k+i,j) 21 continue endif 31 continue 60 continue
mod2  use the home intrinsic to distribute the updates and row exchanges
Moving the control higher in the loop structure distributes more of the work and eliminates the need for the shared local array:cdir$ endmaster call barrier() do 30 j = kp1, n if( home( a( 1, j ) ) .eq. me ) then t = a(l,j) if (l .eq. k) go to 20 a(l,j) = a(k,j) a(k,j) = t 20 continue c call saxpy(nk,t,a(k+1,k),1,a(k+1,j),1) do 21 i = 1, nk a(k+i,j)=t*a(k+i,k)+a(k+i,j) 21 continue endif 30 continue
mod3  use the DO loop indices to distribute the work
The cyclic column distribution of the array means columns residing on the same processor are exactly N$PES columns apart. (N$PES is the number of processors that the program in currently running on.) We can use this information to incorporate the test for locality with the DO loop indices. This way the test for locality is done only once:cdir$ endmaster call barrier() me0 = home( a( 1, k+1 ) ) if( me .ge. me0 ) then istart = k+1 + ( me  me0 ) else istart = k+1 + ( N$PES  ( me0  me ) ) endif do 30 j = istart, n, N$PES t = a(l,j) if (l .eq. k) go to 20 a(l,j) = a(k,j) a(k,j) = t 20 continue c call saxpy(nk,t,a(k+1,k),1,a(k+1,j),1) do 21 i = 1, nk a(k+i,j)=t*a(k+i,k)+a(k+i,j) 21 continue 30 continue
mod4  use temp array to make local copy of multipliers
The multipliers can be copied once to a local array on each PE:cdir$ endmaster call barrier() me0 = home( a( 1, k+1 ) ) istart = k+1 if( me .gt. me0 ) istart = k+1 + ( me  me0 ) if( me .lt. me0 ) istart = k+1 + ( N$PES  ( me0  me ) ) do 29 j = k+1, n temp( j ) = a( j, k ) 29 continue do 30 j = istart, n, N$PES t = a(l,j) if (l .eq. k) go to 20 a(l,j) = a(k,j) a(k,j) = t 20 continue do 21 i = 1, nk a(k+i,j)=t*temp(k+i)+a(k+i,j) c a(k+i,j)=t*a(k+i,k)+a(k+i,j) 21 continue 30 continue
mod5  call a local version of saxpy
With all of the operands of DO loop 21 local to a single PE, the DO loop can be replaced with a call to the optimized uniprocessor version of the BLAS1 library routine, saxpy:cdir$ endmaster call barrier() me0 = home( a( 1, k+1 ) ) istart = k+1 if( me .gt. me0 ) istart = k+1 + ( me  me0 ) if( me .lt. me0 ) istart = k+1 + ( N$PES  ( me0  me ) ) do 29 j = k+1, n temp( j ) = a( j, k ) 29 continue do 30 j = istart, n, N$PES t = a(l,j) if (l .eq. k) go to 20 a(l,j) = a(k,j) a(k,j) = t 20 continue c call saxpy(nk,t,a(k+1,k),1,a(k+1,j),1) call saxpy(nk,t,temp(k+1),1,a(k+1,j),1) c do 21 i = 1, nk c a(k+i,j)=t*temp(k+i)+a(k+i,j) c a(k+i,j)=t*a(k+i,k)+a(k+i,j) c 21 continue 30 continue
Results
Below is a summary of times for this sequence of modifications for only the factorization stage of the linpack problem. For this newsletter, we are only interested in timings for the routine sgefa that we are modifying. To give us some perspective on how we are doing, we add two uniprocessor timings:asis  the uniprocessor version with no modifications lapack  the best uniprocessor version from last week's newsletterThe two uniprocessor times for the unmodified source and the lapack version show the worst and best uniprocessor times.
Times (seconds) for the factorization phase of the linpack problem (sgefa only): problem size 1PE 2PEs 4PEs 8PEs 16PEs 32PEs        asis 100x100 .047 " 1000x1000 60.380 lapack 100x100 .018 " 1000x1000 10.080 craft 100x100 .102 .205 .125 .086 .087 .087 " 1000x1000 259.079 193.042 101.329 61.096 55.418 50.743 mod1 100x100 .100 .280 .159 .101 .090 .094 " 1000x1000 258.962 263.942 136.665 74.586 61.875 54.030 mod2 100x100 .095 .268 .155 .097 .083 .086 " 1000x1000 256.710 153.879 131.945 73.585 60.827 53.443 craft3 100x100 .090 .269 .150 .092 .080 .079 " 1000x1000 242.661 263.045 135.714 73.598 58.906 53.013 craft4 100x100 .070 .210 .122 .081 .064 .066 " 1000x1000 63.377 182.141 91.808 46.891 24.794 14.645 craft5 100x100 .097 .074 .054 .054 .052 .060 " 1000x1000 28.045 15.679 8.545 5.179 3.976 4.236The lapack results on a single processor are hard to beat, but we are not yet done with sgefa, only with optimizing the major DO loop of the factorization phase. Similarly, for the solving phase of the linpack benchmark (the call to sgesl) we have increased the execution time time because of the distribution of the array. The times below give some indication of the cost of using an array distributed among processors as opposed to a local array.
Times (seconds) for both phases (factorization and solving) on linpack problem: 1PE 2PEs sgefa sgesl sgefa sgesl     asis 100x100 .047 .0015 1000x1000 60.380 .1784 lapack 100x100 .018 .0006 1000x1000 10.080 .0556 craft with 100x100 .070 .0017 .459 .0146 distr. array 1000x1000 242.400 .2284 465.900 1.4540In next week's newsletter we'll see how this parallelization for the major DO loop nest influences the rest of the benchmark. Having done a good job on the parallel part we need to reduce the scalar portion to get good overall speedup (an application of Amdahl's law).
Bug in STRSM
In the last newsletter, I mentioned that there was a bug in the BLAS3 routine, strsm. Casimir Suchyta of CRI mailed in this to say the bug is being fixed:> Ed Anderson wanted me to let you know that the problem with an > Operand Range Error in STRSM is SPR 103030 and a fix is already > working its way through the integration process.
A Call for Material
If you have discovered a good technique or information on the T3D and you think it might benefit others, then send it to the email address below and it will be passed on through this newsletter.Current Editors:
Email Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 9074508669 Kate Hedstrom ARSC Oceanographic Specialist ph: 9074508678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 997756020

Subscribe to (or unsubscribe from) the email edition of the
ARSC HPC Users' Newsletter.

Back issues of the ASCII email edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.