## Workstation or Supercomputer?

Now that ARSC is moving to become a "cost recovery center", maybe it's time to look at the economics of using workstations versus supercomputers. I have five overhead slides from John Larson when at CSRD (Center for Supercomputing Research and Development at the University of Illinois) entitled "Which machine do I use?" that described the situation well. The slides describe a procedure for deciding between using a dedicated workstation or sharing a traditional supercomputer. Of course, if one of these machines is free to use, the decision should be easy! I'm going to simplify these slides into the ASCII format of this newsletter.

Let's suppose I have two options for solving my computer problems:

```
Option #1  Dedicated Workstation     Relative Performance = 1
Option #2  Timeshared Supercomputer  Relative Performance = 10
```
Each of us can imagine the workstation and supercomputer that is most applicable to our own situation. On either machine we are interested in how long we as users have to wait for results. How long we wait for results will determine how many results we produce. For each machine we have:
```
Time(turnaround) = Time(computing) + Time(waiting)
```
and, as a general time breakdown we have:
```

<-----------------Time(turnaround)------------->

0--------------------------------------------------job finish

<--Time(waiting)-->
<--Time(computing)--------->

```
On the Dedicated Workstation we have:
```
Time(turnaround) = Time(computing)
Time(waiting) = 0
```
and, as the Dedicated Workstation time breakdown we have:
```

<-----------------Time(turnaround)------------->

0--------------------------------------------------job finish

<-----------------Time(computing)-------------->

```
In this case "dedicated" means it works on our problem alone; we don't share the Dedicated Workstation with anyone. The Expansion Factor is how long the job appears to take with respect to the time actually computing.
```
Expansion Factor = Time(Turnaround) / Time(computing)

For a Dedicated Workstation,    Expansion Factor = 1.
```
On the Timeshared Supercomputer we have a more complicated situation:
```
Expansion Factor = Time(Turnaround) / Time(computing)

For a Timeshared Supercomputer, Expansion Factor > 1.

Expansion Factor    = (may be 5 to 10)
Time(waiting)       = (say, 9) * Time(computing)
Time(computing)     = f(1/relative performance)
Time(waiting)       = Time(queued_to_run) + Time(swapped_out)
Time(swapped_out)   = f(workload, scheduling)
workload = f(number_of_jobs, resource_requirements_per_job)
```
and, as a Timeshared Supercomputer time breakdown we have:
```

<-----------------Time(turnaround)------------->

0--------------------------------------------------job finish

<---------------Time(waiting)----------->
<---->

Time
(computing)
```
With the Expansion Factor, we try to indicate how many users are sharing the same CPU. In the above assumption we have that 5 to 10 users are sharing the same CPU. For any computer we have:
```
Service = Work / Time(turnaround)
Value = Service / Cost
```
Just from this specification of the problem, we have these observations:
1. All other things being equal, if the turnaround times of two machines are the same, choose the cheaper machine.
2. The cost paid for the Dedicated Workstation goes completely toward computing.
3. The cost paid for a Timeshared Supercomputer goes partly toward computing and partly to pay to wait.
Now that we've done this analysis and we're still determined to use the timeshared supercomputer, our next question is:
```
How can I get more Value from the Timeshared Supercomputer?
```
Using the model above we have:
```
Time(turnaround)    = Time(computing) + Time(waiting)
Time(waiting)       = (9) * Time(computing)
Time(computing)     = f(1/relative performance)
Time(waiting)       = Time(queued_to_run) + Time(swapped_out)
Time(swapped_out)   = f(workload, scheduling)
workload = f(number_of_jobs, resource_requirements_per_job)

Service = Work / Time(turnaround)
Value = Service / Cost
```
This sequence distills our options:
```
If my Cost is fixed (and nonzero), I must increase Service.
If my Work is fixed, I must decrease Time (turnaround).
If my Time (computing) is fixed, I must decrease Time (waiting).
```
So to get better value from my timeshared supercomputer the conclusion is:
```
To decrease Time (waiting), the workload must be decreased.
```
But the workload is controlled by the site administration! Usually the site administration is dealing with hundreds of users, with each user having only a small amount of influence. So the user who chooses to use a timeshared supercomputer is in an almost helpless position about getting his work done. How did this happen?

The core of the problem lies in a difference in expectations:

```
The Timeshared Supercomputer salesman said that he sold me
Time(computing), when what I wanted to buy was Time(turnaround).

The salesman forgot to tell me how much Time(waiting) I was getting
for "free". What do I do now?
```
There is not much that can be done by the user. If a large portion of the user's time is Time(waiting), then not even optimizing his code has much of an effect (a perverse form of Amdahl's law). Another option is to determine the expansion factor for this particular timeshared supercomputer (as approximated by wall clock .vs. cpu time) and use this term in the reevaluation of the timeshared supercomputer.

### What can the Site Administration do?

Most of the options lie with site administration, but are not necessarily technical problems but policy and implementation choices:
1. Reduce the workload (Time(waiting))
1. Reduce the number of jobs
1. limit eligible users
2. tighten allocation policies
3. restrict runs or hours used per month
4. allocation - use it or lose it
2. Reduce resource requirements of jobs
1. optimize CPU performance
1. training of staff and users
2. use tools - preprocessors, hpm, atexpert
3. identify and help critical users
4. use better algorithms and software packages
2. optimize memory usage
1. recompute rather than store
2. recycle variables and workspace
3. optimize I/O
1. IOS, SSD, memory-resident datasets
2. asynchronous I/O
2. Increase relative performance (1/Time(computing))
1. Increase utilization
2. Get more powerful machine (speed, bandwidth, memory, I/O)
3. Realize that on a machine with a current expansion factor of 9, if Time(waiting) remains unchanged, Amdahl's Law limits the Time(turnaround) savings to 11% as Time(computing) goes to 0.
Each of us has a role to play in this game, but it's also good to understand the role of others too.

## The Linpack Benchmark on the T3D (with more than one processor)

Being a "one DO loop benchmark", optimizing linpack on a multiprocessor can be easy. We just concentrate on the major loop and then when it is running well we use the same optimizations on all other loops. From last week's newsletter we know that the DO loop nest of interest is:
```
do 30 j = kp1, n
t = a(l,j)
if (l .eq. k) go to 20
a(l,j) = a(k,j)
a(k,j) = t
20          continue
c              call saxpy(n-k,t,a(k+1,k),1,a(k+1,j),1)
do 21 i = 1, n-k
a(k+i,j)=t*a(k+i,k)+a(k+i,j)
21          continue
30       continue
```
where the call to the saxpy BLAS1 routine has been inlined. Also we have chosen the distribution of the two dimensional matrix for the 100x100 and 1000x1000 problems as:
```
100x100 problem:                     1000x1000 problem:

parameter( lda = 128 )               parameter( lda = 1024 )
parameter( n = 100 )                 parameter( n = 1000 )
real a( lda, lda )                   real a( lda, lda )
cdir\$ shared a( :, :block(1) )       cdir\$ shared a( :, :block(1) )
```
This choice of declarations was made because:
1. Shared arrays must follow some power-of-2 restrictions (for Craft Fortran).
2. Distribution by column preserves column access that is essential for a cache based processor (the T3D uses the DEC Alpha).
3. Cyclic distribution of the columns provides natural load balancing during the factorization (Experience).
Initially, we only want to study the effects of parallelizing the main DO loop nest. In Craft Fortran, we can use the master/endmaster directives to restrict execution to PE0 and we can use the barrier call to synchronize all processors at the beginning of the shared loop. With these constructs, the general outline of control flow is:
```
program main
c declarations
cdir\$ master
.                 ! start out uniprocessing
.
.
cdir\$ endmaster
call sgefa(       ! call routine that is executed on multiple processors
cdir\$ master
.
.
.
cdir\$ endmaster
end

subroutine sgefa(         ! here's the routine to share
c declarations
do 60 k = 1, n-1          ! all processors cycle through major loop
cdir\$ master
.                         ! PE0 the master does:
.                            a. find pivot
.                            b. form multipliers ...
cdir\$ endmaster
call barrier              ! sync
do 30 j = kp1, n                              ! shared loop nest
t = a(l,j)                                 !
if (l .eq. k) go to 20                     ! exchange pivoted
a(l,j) = a(k,j)                         ! rows
a(k,j) = t                              !
20          continue                                   !
c              call saxpy(n-k,t,a(k+1,k),1,a(k+1,j),1)    ! inlined updates
do 21 i = 1, n-k                           !
a(k+i,j)=t*a(k+i,k)+a(k+i,j)            !
21          continue                                   !
30       continue                                      !
60   continue
.
.
.
end
```
Below is a sequence of six versions of this DO loop nest for parallelization on the T3D:
• craft - a simple Craft version from the fpp'ed source
• mod1 - use the home intrinsic to distribute the updates
• mod2 - use the home intrinsic to distribute the updates and rowexchanges
• mod3 - use the DO loop indices to distribute the work
• mod4 - use temp array to make local copy of multipliers
• mod5 - call a local version of saxpy

### craft - simple craft version from the fpp'ed source

Using the transformation from fpp, as shown in last week's newsletter, we break the DO loop nest into the row exchanges done on PE0 and the updates which are done as a DO shared loop. The 'doshared' construct of Craft Fortran distributes the DO 31 work among the processors:
```
do 30 j = kp1, n
t = a(l,j)
temp( j ) = t
if (l .eq. k) go to 20
a(l,j) = a(k,j)
a(k,j) = t
20          continue
30          continue
cdir\$ endmaster
call barrier()
cdir\$ doshared( j ) on a( k+i, j )
do 31 j = k+1, n
c              call saxpy(n-k,t,a(k+1,k),1,a(k+1,j),1)
do 21 i = 1, n-k
a(k+i,j)=temp(j)*a(k+i,k)+a(k+i,j)
21          continue
31       continue
60 continue
```

### mod1 - use the home intrinsic to distribute the updates

The 'home' intrinsic returns the PE number on which the shared array element resides. In the code below, we use this to keep most of the update DO loop 21 computation local to the PE that owns the column being updated. A shared array temp is filled with the multipliers by PE0 and then accessed by the other PEs:
```
do 30 j = kp1, n
t = a(l,j)
temp( j ) = t
if (l .eq. k) go to 20
a(l,j) = a(k,j)
a(k,j) = t
20          continue
30          continue
cdir\$ endmaster
call barrier()
do 31 j = k+1, n
if( home( a( 1, j ) ) .eq. me ) then
c              call saxpy(n-k,t,a(k+1,k),1,a(k+1,j),1)
do 21 i = 1, n-k
a(k+i,j)=temp(j)*a(k+i,k)+a(k+i,j)
21          continue
endif
31       continue
60 continue
```

### mod2 - use the home intrinsic to distribute the updates and row exchanges

Moving the control higher in the loop structure distributes more of the work and eliminates the need for the shared local array:
```
cdir\$ endmaster
call barrier()
do 30 j = kp1, n
if( home( a( 1, j ) ) .eq. me ) then
t = a(l,j)
if (l .eq. k) go to 20
a(l,j) = a(k,j)
a(k,j) = t
20          continue
c              call saxpy(n-k,t,a(k+1,k),1,a(k+1,j),1)
do 21 i = 1, n-k
a(k+i,j)=t*a(k+i,k)+a(k+i,j)
21          continue
endif
30       continue
```

### mod3 - use the DO loop indices to distribute the work

The cyclic column distribution of the array means columns residing on the same processor are exactly N\$PES columns apart. (N\$PES is the number of processors that the program in currently running on.) We can use this information to incorporate the test for locality with the DO loop indices. This way the test for locality is done only once:
```
cdir\$ endmaster
call barrier()
me0 = home( a( 1, k+1 ) )
if( me .ge. me0 ) then
istart = k+1 + ( me - me0 )
else
istart = k+1 + ( N\$PES - ( me0 - me ) )
endif
do 30 j = istart, n, N\$PES
t = a(l,j)
if (l .eq. k) go to 20
a(l,j) = a(k,j)
a(k,j) = t
20          continue
c              call saxpy(n-k,t,a(k+1,k),1,a(k+1,j),1)
do 21 i = 1, n-k
a(k+i,j)=t*a(k+i,k)+a(k+i,j)
21          continue
30       continue
```

### mod4 - use temp array to make local copy of multipliers

The multipliers can be copied once to a local array on each PE:
```
cdir\$ endmaster
call barrier()
me0 = home( a( 1, k+1 ) )
istart = k+1
if( me .gt. me0 ) istart = k+1 + ( me - me0 )
if( me .lt. me0 ) istart = k+1 + ( N\$PES - ( me0 - me ) )
do 29 j = k+1, n
temp( j ) = a( j, k )
29        continue
do 30 j = istart, n, N\$PES
t = a(l,j)
if (l .eq. k) go to 20
a(l,j) = a(k,j)
a(k,j) = t
20          continue
do 21 i = 1, n-k
a(k+i,j)=t*temp(k+i)+a(k+i,j)
c                 a(k+i,j)=t*a(k+i,k)+a(k+i,j)
21          continue
30       continue
```

### mod5 - call a local version of saxpy

With all of the operands of DO loop 21 local to a single PE, the DO loop can be replaced with a call to the optimized uniprocessor version of the BLAS1 library routine, saxpy:
```
cdir\$ endmaster
call barrier()
me0 = home( a( 1, k+1 ) )
istart = k+1
if( me .gt. me0 ) istart = k+1 + ( me - me0 )
if( me .lt. me0 ) istart = k+1 + ( N\$PES - ( me0 - me ) )
do 29 j = k+1, n
temp( j ) = a( j, k )
29        continue
do 30 j = istart, n, N\$PES
t = a(l,j)
if (l .eq. k) go to 20
a(l,j) = a(k,j)
a(k,j) = t
20          continue
c              call saxpy(n-k,t,a(k+1,k),1,a(k+1,j),1)
call saxpy(n-k,t,temp(k+1),1,a(k+1,j),1)
c              do 21 i = 1, n-k
c                 a(k+i,j)=t*temp(k+i)+a(k+i,j)
c                 a(k+i,j)=t*a(k+i,k)+a(k+i,j)
c  21          continue
30       continue
```

### Results

Below is a summary of times for this sequence of modifications for only the factorization stage of the linpack problem. For this newsletter, we are only interested in timings for the routine sgefa that we are modifying. To give us some perspective on how we are doing, we add two uniprocessor timings:
```
asis   - the uniprocessor version with no modifications
lapack - the best uniprocessor version from last week's newsletter
```
The two uniprocessor times for the unmodified source and the lapack version show the worst and best uniprocessor times.
```
Times (seconds) for the factorization phase of the linpack problem
(sgefa only):

problem size   1PE     2PEs     4PEs    8PEs   16PEs   32PEs
------------   ---     ----     ----    ----   -----   -----
asis    100x100      .047
"    1000x1000   60.380
lapack  100x100      .018
"    1000x1000   10.080
craft   100x100      .102     .205     .125    .086    .087    .087
"    1000x1000  259.079  193.042  101.329  61.096  55.418  50.743
mod1    100x100      .100     .280     .159    .101    .090    .094
"    1000x1000  258.962  263.942  136.665  74.586  61.875  54.030
mod2    100x100      .095     .268     .155    .097    .083    .086
"    1000x1000  256.710  153.879  131.945  73.585  60.827  53.443
craft3  100x100      .090     .269     .150    .092    .080    .079
"    1000x1000  242.661  263.045  135.714  73.598  58.906  53.013
craft4  100x100      .070     .210     .122    .081    .064    .066
"    1000x1000   63.377  182.141   91.808  46.891  24.794  14.645
craft5  100x100      .097     .074     .054    .054    .052    .060
"    1000x1000   28.045   15.679    8.545   5.179   3.976   4.236
```
The lapack results on a single processor are hard to beat, but we are not yet done with sgefa, only with optimizing the major DO loop of the factorization phase. Similarly, for the solving phase of the linpack benchmark (the call to sgesl) we have increased the execution time time because of the distribution of the array. The times below give some indication of the cost of using an array distributed among processors as opposed to a local array.
```
Times (seconds) for both phases (factorization and solving)
on linpack problem:
1PE            2PEs
sgefa  sgesl    sgefa   sgesl
-----  -----    -----   -----
asis           100x100       .047  .0015
1000x1000    60.380  .1784

lapack         100x100       .018  .0006
1000x1000    10.080  .0556

craft with     100x100       .070  .0017     .459   .0146
distr. array  1000x1000   242.400  .2284  465.900  1.4540
```
In next week's newsletter we'll see how this parallelization for the major DO loop nest influences the rest of the benchmark. Having done a good job on the parallel part we need to reduce the scalar portion to get good overall speedup (an application of Amdahl's law).

## Bug in STRSM

In the last newsletter, I mentioned that there was a bug in the BLAS3 routine, strsm. Casimir Suchyta of CRI mailed in this to say the bug is being fixed:
```
> Ed Anderson wanted me to let you know that the problem with an
> Operand Range Error in STRSM is SPR 103030 and a fix is already
> working its way through the integration process.
```

## A Call for Material

If you have discovered a good technique or information on the T3D and you think it might benefit others, then send it to the email address below and it will be passed on through this newsletter.
Current Editors:
 Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
E-mail Subscriptions:
Archives:
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.