## Matrix-Vector Multiplication on the Y-MP and T3D

For an upcoming class, I wanted to illustrate the 'range of performance' that is available on a single machine as a function of optimization effort. Like many computer scientists before me, I milked the example of matrix-vector multiplication. For my timings, I took the problem of:

```
Ab = c
```
where A is a n by n matrix and b and c are vectors of n elements.

There is a plethora (nice word) of ways of doing this multiplication, I stopped accumulating them at 10.

### Top Ten ways of Multiplying a Matrix by a Vector on CRI Hardware

1. sgemm - BLAS3 routine, in libsci
2. sgemv - BLAS2 routine, in libsci
3. mxma - old libsci routine, missing in the T3D version of libsci
4. calls to saxpy - BLAS1 routine, in libsci, outer product formulation
5. calls to sdot - BLAS1 routine, in libsci, inner product formulation
6. calls to Fortran version with a saxpy loop
7. calls to Fortran version with a saxpy loop with one IF statement
8. calls to Fortran version with a saxpy loop with two IF statements
9. calls to Fortran version with a saxpy loop with loop unrolled 4 times
10. calls to Fortran version with a sdot loop
The complete runnable source for the timing program is given below. With this program I timed matrix-vector multiplications of various sizes and produced the tables of MFLOPS for the ARSC Y-MP and a single PE of the T3D for the 37 different problem sizes shown below.

### Table 1

MFLOPS for matrix-vector multiplication methods on the ARSC Y-MP
```
size sgemm  sgemv  mxma  call   call fsaxpy fsaxpy1 fsaxpy2 fsaxpy3 fsdot
saxpy  sdot

1    0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0
2    1    0.1    0.1    0.1    0.1    0.1    0.1    0.1    0.1    0.1    0.1
3    2    0.3    0.6    0.5    0.5    0.5    0.7    0.6    0.5    0.6    0.5
4    3    0.9    1.4    1.3    1.1    1.0    1.6    1.5    1.1    1.4    1.1
5    4    1.6    2.5    2.3    1.9    1.7    2.7    2.5    1.8    2.6    1.8
6    5    2.5    4.0    3.6    2.8    2.3    4.2    3.8    2.5    3.9    2.5
7    6    3.7    5.8    5.2    3.6    3.0    5.8    5.2    3.3    5.5    3.3
8    7    5.1    7.7    7.0    4.6    3.7    7.6    6.6    4.1    7.2    4.2
9    8    6.6   10.1    9.0    5.6    4.5    9.5    8.2    4.9    9.6    5.0
10    9    8.3   12.5   11.3    6.6    3.6   11.5    9.8    5.8   11.8    5.8
11   10   10.1   15.0   13.5    7.6    4.0   13.5   11.3    6.5   13.6    6.6
12   16   23.8   34.2   30.2   13.5    6.7   26.2   21.4   11.0   29.9   11.2
13   20   34.9   48.4   42.4   17.8    7.6   34.8   27.9   13.4   41.9   13.9
14   30   66.0   86.4   70.2   27.5   11.0   53.8   42.5   18.3   70.6   19.4
15   32   72.7   93.6   75.6   29.3   11.6   56.6   44.9   19.2   77.9   20.6
16   40   97.1  119.6   92.5   35.3   12.8   65.2   52.9   21.7   99.2   23.7
17   50  125.6  148.7  107.3   41.1   15.2   75.5   61.7   24.4  115.5   26.9
18   60  150.6  172.0  118.6   48.6   17.3   83.0   68.8   26.3  131.2   29.6
19   63  153.7  174.9  123.0   50.3   17.9   84.9   70.9   26.9  132.3   30.3
20   64  159.2  179.7  122.7   51.0   18.2   86.4   71.9   27.1  136.5   30.8
21   65  133.7  145.4  103.8   47.8   17.8   74.2   63.4   21.6  117.4   27.2
22   70  143.6  152.7  109.4   50.2   18.9   77.2   66.2   22.5  123.2   29.5
23   80  157.9  170.9  120.6   55.2   21.7   82.7   72.2   24.2  135.1   33.4
24   90  173.2  185.4  127.1   60.9   23.9   88.2   77.6   25.7  142.6   35.9
25  100  180.2  192.7  133.0   65.4   25.9   91.5   80.9   27.0  156.1   38.3
26  128  212.3  221.1  141.6   74.8   31.1   99.4   89.1   29.9  168.0   43.1
27  200  210.2  210.4  139.5   85.1   39.2   97.6   90.9   28.2  168.8   54.0
28  256  231.0  234.5  148.0   94.6   46.1  105.4   99.9   31.2  181.5   59.3
29  300  230.8  231.2  146.8   97.6   47.8  105.9  100.9   30.9  182.1   65.5
30  400  228.4  230.2  147.0  102.5   53.8  105.6  101.5   30.6  180.9   69.2
31  500  234.8  235.8  149.7  108.4   60.9  108.3  106.2   31.8  187.2   72.0
32  512  240.3  242.1  150.6  109.7   61.3  109.2  106.1   31.9  189.7   72.4
33  600  235.0  236.4  149.6  111.7   64.6  108.8  106.4   31.3  185.1   75.0
34  700  240.8  240.3  149.9  113.8   65.8  109.8  108.0   32.3  190.6   78.9
35  800  239.7  238.6  149.9  115.1   67.7  108.9  106.4   31.6  189.2   79.5
36  900  230.0  230.6  149.0  115.1   69.1  109.3  106.9   31.4  187.5   79.1
37 1000  236.1  235.5  150.8  118.9   72.8  109.2  108.0   31.9  190.0   80.4
```

### Table 2

MFLOPS for matrix-vector multiplication methods on the T3D (1PE)
```
size sgemm  sgemv  mxma  call   call fsaxpy fsaxpy1 fsaxpy2 fsaxpy3 fsdot
saxpy  sdot

1    0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0
2    1    0.1    0.1    0.0    0.1    0.1    0.5    0.5    0.5    0.3    0.5
3    2    0.4    0.7    0.0    1.0    0.9    2.3    2.2    2.0    1.7    2.3
4    3    1.1    1.8    0.0    2.2    1.9    4.6    4.5    3.9    3.7    5.6
5    4    1.7    2.6    0.0    3.5    3.1    6.5    6.6    5.4    6.1    6.9
6    5    1.9    3.3    0.0    4.0    4.2    8.6    8.6    6.9    7.7    8.7
7    6    2.9    4.8    0.0    5.0    5.3   10.0   10.1    8.2    9.6   11.5
8    7    4.1    6.9    0.0    6.8    6.5   11.2   11.4    9.3   11.1   13.2
9    8    3.4    4.5    0.0    7.7    6.1   11.7   12.0   10.1   16.7   15.0
10    9    3.6    5.1    0.0    7.6    6.8   12.5   13.2   10.6   17.1   15.5
11   10    4.2    6.1    0.0    8.2    7.9   13.0   13.9   11.3   17.5   17.2
12   16    9.0   15.2    0.0   10.1   12.2   14.7   15.5   12.7   24.6   18.9
13   20    9.9   15.1    0.0   11.3   12.0   15.3   16.5   13.4   27.0   20.2
14   30   12.8   23.9    0.0   13.9   13.1   12.4   12.9   11.0   18.6   15.2
15   32   15.5   29.9    0.0   15.4   13.1   12.1   12.5   10.7   19.1   14.7
16   40   15.2   29.2    0.0   16.6   13.1   11.5   11.8    8.2   16.6   13.4
17   50   16.0   32.0    0.0   17.0   12.2   11.1   11.4    9.7   16.5   11.7
18   60   17.3   35.0    0.0   18.5   12.8   11.3   11.6    9.8   17.2   12.0
19   63   16.6   33.1    0.0   18.1   11.4   11.0   11.7    9.8   16.8   11.9
20   64   17.8   36.4    0.0   14.1    5.2   11.3   11.7    9.8   17.4   11.9
21   65   16.6   26.4    0.0   14.3    5.2   11.3   11.6    9.8   17.2   11.9
22   70   16.7   32.9    0.0   14.4    5.2   11.0   11.0    9.8   17.2   11.9
23   80   18.5   37.2    0.0   15.8    5.3   11.5   11.8    9.3   17.8   12.0
24   90   17.9   34.6    0.0   16.4    5.4   11.3   11.0    9.6   16.8   11.1
25  100   17.9   35.7    0.0   17.1    5.4   11.3   11.7    9.7   17.2    9.5
26  128   17.8   38.5    0.0   19.0    5.7   11.1   11.8    9.8   17.4    7.3
27  200   18.6   38.3    0.0   21.2    5.9   11.2   11.6    9.6   17.0    6.2
28  256   18.7   39.4    0.0   22.6    5.9   11.2   11.5    9.6   17.3    6.1
29  300   18.7   39.3    0.0   23.2    5.9   11.1   11.4    9.5   17.2    6.1
30  400   19.0   39.2    0.0   24.4    6.0   10.8   11.2    9.3   17.0    5.9
31  500   18.7   39.0    0.0   24.9    6.1   10.6   10.9    9.2   16.9    5.8
32  512   18.8   39.4    0.0   25.0    6.1   10.6   10.9    9.1   16.9    5.8
33  600   18.8   39.2    0.0   25.4    6.1   10.5   10.8    9.0   16.7    5.7
34  700   18.9   39.4    0.0   25.7    6.1   10.2   10.5    8.8   16.6    5.6
35  800   18.9   39.5    0.0   26.0    6.1   10.1   10.4    8.7   16.4    5.4
36  900   18.8   39.3    0.0   26.1    6.1    9.9   10.2    8.5   16.3    5.4
37 1000   18.9   39.4    0.0   26.3    6.1    9.7   10.0    8.4   16.3    5.4
```

### Graphical Presentation

Both tables contain the results of 3700 timing experiments and it is almost impossible to extract the trends without graphing the data. To do this, I like to use the GNU tool, gnuplot. Below is a typical makefile and plotfile for gnuplot:
```
#makefile
all:        results.t3d.100 t3d.plot
awk '{ print \$\$2, \$\$3 }' results.t3d.100 > sgemm
awk '{ print \$\$2, \$\$4 }' results.t3d.100 > sgemv
awk '{ print \$\$2, \$\$6 }' results.t3d.100 > callsaxpy
awk '{ print \$\$2, \$\$7 }' results.t3d.100 > callsdot
awk '{ print \$\$2, \$\$8 }' results.t3d.100 > fsaxpy
awk '{ print \$\$2, \$\$9 }' results.t3d.100 > fsaxpy1
awk '{ print \$\$2, \$\$10 }' results.t3d.100 > fsaxpy2
awk '{ print \$\$2, \$\$11 }' results.t3d.100 > fsaxpy3
awk '{ print \$\$2, \$\$12 }' results.t3d.100 > fsdot
gnuplot t3d.plot
lprt out

#plotfile - t3d.plot
set output 'out'
set term postscript
set title "Matrix Vector Multiplication on the ARSC T3D"
set yzeroaxis
set samples 37
set xlabel "Order of Matrix"
set ylabel "Mflop/s rate"
#set noborder
plot  'sgemm' with linespoints 1 1,   'sgemv'  with linespoints 2 2,    \
'fsaxpy3' with linespoints 3 3, 'callsaxpy' with linespoints 5 5, \
'fsaxpy' with linespoints 6 6,  'fsaxpy1' with linespoints 7 7,   \
'fsaxpy2' with linespoints 8 8, 'callsdot' with linespoints 9 9,  \
'fsdot' with linespoints 10 10
```
With the results of Table 1 and Table 2 it is easy to extract the following:

On the Y-MP:

1. The asymptotic speeds for sgemm and sgemv are almost identical.
2. As with all CRI vector machines, the asymptotic speed is almost reached at size 200.
3. The cost of doing a problem of size 65 is substantially larger than one of 64. (The overhead of paritioning loops into segments of 64 or less is doubled.)
4. The version using the unrolled implementation of saxpy is the fastest of all Fortran implementations.
5. Enough IF statements can drive the performance to scalar speeds.
On the T3D:
1. The breakdown of cache coherency for all methods happens for problems less than size 100.
2. The winner is sgemv by a wide margin.
3. There is some anomaly for sgemv at size 65. (But the T3D doesn't have vectors!?)
4. All asymptotic speeds are below 40 MFLOPS, All Fortran asymptotic speeds are below 20 MFLOPS.
5. The simplest Fortran implementation of sdot and saxpy have almost identical speeds.
For those who don't want to manipulate the tables above with gnuplot. I can e-mail postscript versions. (This ASCII newsletter has some defining limits.)

## MPI Keeps on Growing

```
> Parallel Programming with MPI
> March 5 & 6, 1996 at OSC
>
> The Ohio Supercomputer Center (OSC) is offering a two-day
> course on using the Message Passing Interface (MPI) standard
> to write parallel programs on several of the OSC MPP systems.
> For more information on MPI, see
> http://www.osc.edu/Lam.html#MPI on the WWW.
>
> MPI topics to be covered include a variety of processor-to-
> processor communication routines, collective operations
> performed by groups of processors, defining and using high-
> level processor connection topologies, and user-specified
> derived data types for message creation.
>
> The MPI workshop will be a combination of lectures and
> hands-on lab session in which the participants will write
> and execute sample MPI programs.
>
> Interested parties should contact Aline Davis at
> aline@osc.edu or (614) 292-9248. Due to the hands-on nature
> of the workshop, REGISTRATION IS LIMITED TO 20 STUDENTS.
```

## New High Performance C++ Compiler for the Cray T3D

Not everyone is happy with CRI's C products for the T3D, there have been several efforts to supplement CRI's efforts:
• The ACC compiler - ARSC's T3D newsletter#46 (8/4/95)
• The Split C compiler -
Recently I received this product announcement:
```
> Kuck & Associates, Inc. (KAI) announces the availability of
> the Photon C++ compiler for the Cray T3D computer
> architecture. Photon C++ has optimizations that allow
> developers to use object-oriented design all the way into
> the kernels of the application, and still achieve the
> performance of C code. As an assist to developing
> applications for the Cray T3D, Photon C++ is also available
> on every major Unix workstation.
>
> Photon C++ provides near draft standard syntax and a near
> draft standard C++ class library. In addition, for those
> with legacy codes, Photon C++ has Cfront 3.0 and 2.1
> compatibility modes.
>
> Photon C++ optimizes several paradigms used in object-
> oriented programming. Photon C++ automatically optimizes
> lightweight objects (objects that are created, used, and
> destroyed frequently), data abstractions (allowing the
> programmer to leave them in object-oriented form), and
> control flow to the most efficient form (allowing
> structured control flow to be maintained). Photon C++
> eliminates redundant tests, allowing self-checking member
> functions to be used efficiently.
>
> Photon C++ supports name spaces, exceptions, templates
> (with automatic instantiation), global constructors, RTTI,
> and STL.
>
> Look at this web page for more information on Photon C++
> on the Cray T3D:
>
>
http://www.kai.com/photon/photon_t3d.html

>
> If you are looking for a single compiler to use across all
> of your development and production systems, consider using
> Photon C++ on your Unix workstations. Evaluation copies of
> Photon C++ are available now for these workstations. Look
> at this web page for more information on Photon C++ for
> Unix Workstations:
>
>
http://www.kai.com/photon/photon_what_is.html

>
> You can contact KAI at:
>
>         Kuck & Associates, Inc.   e-mail: kai@kai.com
>         1906 Fox Drive            Voice:  +1-217-356-2288
>         Champaign, IL 61820       Fax:    +1-217-356-5199
>         USA
>
```

## List of Differences Between T3D and Y-MP

The current list of differences between the T3D and the Y-MP is:
1. Data type sizes are not the same (Newsletter #5)
2. Uninitialized variables are different (Newsletter #6)
3. The effect of the -a static compiler switch (Newsletter #7)
4. There is no GETENV on the T3D (Newsletter #8)
5. Missing routine SMACH on T3D (Newsletter #9)
6. Different Arithmetics (Newsletter #9)
7. Different clock granularities for gettimeofday (Newsletter #11)
8. Restrictions on record length for direct I/O files (Newsletter #19)
9. Implied DO loop is not "vectorized" on the T3D (Newsletter #20)
10. Missing Linpack and Eispack routines in libsci (Newsletter #25)
11. F90 manual for Y-MP, no manual for T3D (Newsletter #31)
12. RANF() and its manpage differ between machines (Newsletter #37)
13. CRAY2IEG is available only on the Y-MP (Newsletter #40)
14. Missing sort routines on the T3D (Newsletter #41)
15. Missing compiler allocation flags (Newsletter #52)
16. Missing compiler listing flags (Newsletter #53)
17. Missing MXMA routine on the T3D (Newsletter #75)
I encourage users to e-mail in differences that they have found, so we all can benefit from each other's experience.

Here is a shortened version of the above timing program:

```
parameter( lda = 1001, nmax = 1000, ncases = 37, maxtrips = 100 )
integer index( ncases )
real a( lda, 1000 )
real b( nmax ), c( nmax ), d( nmax )
real t( ncases, 10 )
data index / 0,1,2,3,4,5,6,7,8,9,10,16,20,30,32,40,50,60,63,64,65,
+                     70,80,90,100,128,200,256,300,400,500,512,600,700,800,
+                     900,1000/
do 100 j = 1, nmax
do 90 i = 1, nmax
a( i, j ) = j
90      continue
b( j ) = j
100   continue
d( 1 ) = 1.0
do 110 i = 2, nmax
d( i ) = d( i-1 ) + i * i
110   continue
do 130 j = 1, 10
do 120 i = 1, ncases
t( i, j ) = 0.0
120      continue
130   continue
do 2000 ntrips = 1, maxtrips
do 1000 kcase = 1, ncases
n = index( kcase )
tt = second()
call sgemm( 'n','n',n,1,n,1.0,a,lda,b,nmax,0.0,c,nmax)
t( kcase, 1 ) = t( kcase, 1 ) + second() - tt
do 210 i = 1, n
error = c( i ) - d( n )
if( error .ne. 0.0 ) then
print *, ' error with sgemm', kcase, n, i, c( i ), d( n )
stop
endif
210      continue
tt = second()
call sgemv( 'n', n, n, 1.0, a, lda, b, 1, 0.0, c, 1 )
t( kcase, 2 ) = t( kcase, 2 ) + second() - tt
if( n .gt. 0 ) then
tt = second()
call mxma( a, 1, lda, b, 1, nmax, c, 1, nmax, n, n, 1 )
t( kcase, 3 ) = t( kcase, 3 ) + second() - tt
else
t(kcase,3)=1.0
endif
tt = second()
call callsaxpy( a, lda, b, c, n )
t( kcase, 4 ) = t( kcase, 4 ) + second() - tt
tt = second()
call callsdot( a, lda, b, c, n )
t( kcase, 5 ) = t( kcase, 5 ) + second() - tt
tt = second()
call fsaxpy( a, lda, b, c, n )
t( kcase, 6 ) = t( kcase, 6 ) + second() - tt
tt = second()
call fsaxpy1( a, lda, b, c, n )
t( kcase, 7 ) = t( kcase, 7 ) + second() - tt
tt = second()
call fsaxpy2( a, lda, b, c, n )
t( kcase, 8 ) = t( kcase, 8 ) + second() - tt
tt = second()
call fsaxpy3( a, lda, b, c, n )
t( kcase, 9 ) = t( kcase, 9 ) + second() - tt
tt = second()
call fsdot( a, lda, b, c, n )
t( kcase, 10 ) = t( kcase, 10 ) + second() - tt
1000   continue
2000   continue
write( 6, 601 )
write( 6, 602 )
do 3000 i = 1, ncases
ops = maxtrips * index(i) * ( index(i) + index(i)-1 ) / 1.0e6
write( 6, 600 )i,index(i),(ops/t(i,j),j=1,10)
3000   continue
600   format( i3, i5, 10f7.1 )
601   format( 'case size sgemm  sgemv   mxma   call    call',
&          ' fsaxpy fsaxpy1 fsaxpy2 fsaxpy3 fsdot' )
602   format( '                                saxpy   sdot' )
end
subroutine callsaxpy( a, lda, b, c, n )
real a( lda, 1 ), b( 1 ), c( 1 )
do 10 i = 1, n
c( i ) = 0.0
10    continue
do 20 i = 1, n
call saxpy( n, b( i ), a( 1, i ), 1, c, 1 )
20    continue
end
subroutine fsaxpy( a, lda, b, c, n )
real a( lda, 1 ), b( 1 ), c( 1 )
do 10 i = 1, n
c( i ) = 0.0
10    continue
do 20 j = 1, n
do 9 i = 1, n
c( i ) = c( i ) + b( j ) * a( i, j )
9       continue
20    continue
end
subroutine fsaxpy1( a, lda, b, c, n )
real a( lda, 1 ), b( 1 ), c( 1 )
do 10 i = 1, n
c( i ) = 0.0
10    continue
do 20 j = 1, n
if( b( j ) .ne. 0.0 ) then
do 11 i = 1, n
c( i ) = c( i ) + b( j ) * a( i, j )
11          continue
endif
20    continue
end
subroutine fsaxpy2( a, lda, b, c, n )
real a( lda, 1 ), b( 1 ), c( 1 )
do 10 i = 1, n
c( i ) = 0.0
10    continue
do 20 j = 1, n
if( b( j ) .ne. 0.0 ) then
do 11 i = 1, n
if( a( i, j ) .ne. 0.0 ) then
c( i ) = c( i ) + b( j ) * a( i, j )
endif
11          continue
endif
20    continue
end
subroutine fsaxpy3( a, lda, b, c, n )
real a( lda, 1 ), b( 1 ), c( 1 )
do 10 i = 1, n
c( i ) = 0.0
10    continue
k = 0
do 20 j = 1, n-3, 4
do 11 i = 1, n
c(i)=c(i)+
&     b(j)*a(i,j)+b(j+1)*a(i,j+1)+b(j+2)*a(i,j+2)+b(j+3)*a(i,j+3)
11       continue
k = k + 1
20    continue
do 30 j = 4*k+1, n
do 21 i = 1, n
c( i ) = c( i ) + b( j ) * a( i, j )
21       continue
30    continue
end
subroutine callsdot( a, lda, b, c, n )
real a( lda, 1 ), b( 1 ), c( 1 )
do 10 i = 1, n
c( i ) = sdot( n, a( i, 1 ), lda, b, 1 )
10    continue
end
subroutine fsdot( a, lda, b, c, n )
real a( lda, 1 ), b( 1 ), c( 1 )
do 10 i = 1, n
c( i ) = 0.0
do 9 j = 1, n
c( i ) = c( i ) + a( i, j ) * b( j )
9       continue
10    continue
end
```

Current Editors:
 Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
E-mail Subscriptions:
Archives:
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.