ARSC T3D Users' Newsletter 20, January 30, 1995
ARSC T3D Upgrades
The next month or two will be a busy time for the ARSC T3D. We will be upgrading the following:
- 2MW to 8MW per PE, tentatively set for February 7th and 8th
- The T3D OS, MAX 1.1 to MAX 1.2, set for January 31st
- The T3D Programming Environment (libraries, tools and compilers) P.E. 1.1 to P.E. 1.2 sometime in the next two months.
Next T3D Class at ARSC
Introduction to Programming the CRAY T3D
Dates: February 8 - 10, 1995 Time: 9:00 AM - noon, 1:00 - 5:00 PM Location: University of Alaska Fairbanks main campus, room TBA Instructor: Mike Ess, Parallel Applications Consultant Course Description: To satisfy increasing computational demands, computers of the future must have multiple processors executing on the same program. The Cray T3D is a step in this direction. The Cray T3D, a MPP or Massively Parallel Processor, consists of 128 processors attached to the Cray Y-MP.
This class will cover the characterization and history of MPPs. With this background the students will experience how the T3D approaches the problem of executing a program in parallel. The class will cover the three programming paradigms for extracting parallelism:
- Data-sharing, as with Fortran 90
- Work-sharing, as with Craft Fortran and
- Message-passing as implemented with PVM or shmem
- Performance measurement and tools
- Debugging techniques and tools
Application ProcedureThere is no charge for attendance, but enrollment will be limited to 15. In the event of greater demand, applicants will be selected by ARSC staff based on qualifications, need, and order and completeness of application. The class may be cancelled if there are fewer than 5 applicants.
Send e-mail to email@example.com with the following information:
- course name
- your name
- UA status (e.g., undergrad, grad, Asst. Prof.)
- advisor (if you are a student)
- denali userid
- preferred e-mail address
- describe programming experience
- describe need for this class
I/O on the T3D and Y-MPIn investigating the prospect of implementing Phase II I/O on the T3D, we at ARSC have begun measuring I/O speeds both on the Y-MP and the T3D. Measuring I/O is a complicated situation because:
It is always implemented with shared resources:
- shared physical disks
- shared I/O devices
- shared system buffers
- It depends on the operating system to service user requests and the availability of the OS depends on system load.
Its environment is not uniform across Y-MP systems
- what physical devices?
- SSD or BMR or LDcache in memory?
- T3D or Y-MP?
A rich set of user options are available:
- formatted or unformatted?
- sequential or direct?
- record size large or small?
Table 1 Y-MP speeds (MW/sec) for unformatted I/O on two different file systems array size /u1 file system /tmp file system (in words) reads writes reads writes ---------- ------ ------ ------ ------ 1024 13.272 13.118 13.289 13.136 2048 22.613 22.456 21.866 22.428 4096 33.306 33.764 33.561 33.761 8192 45.969 46.082 45.968 45.948 16384 55.130 55.903 55.163 54.907 32768 0.365 12.266 25.676 25.878 65536 1.103 1.319 25.725 23.909 131072 1.014 0.874 22.260 22.037 262144 0.912 0.967 21.106 21.978Of course speed increases with the size of the transfer but only while the size of the buffer is larger than the size of the transfers. The high speed on the /tmp file system is due (among other things) to the LDcache which is like a ram disk used as a buffer and is made out of some of the 1GW of memory on ARSC's Y-MP. (So higher I/O speeds in another reason why users should work out of /tmp, rather than their home directories that are not LDcached.)
Next, we compare the Y-MP uniprocessor speeds to those of a single T3D PE running the same program:
Table 2 Y-MP and T3D speeds (MW/sec) for unformatted I/O on the /tmp file system array size /u1 file system /tmp file system (in words) reads writes reads writes ---------- ------ ------ ------ ------ 1024 13.289 13.136 3.806 5.144 2048 21.866 22.428 6.046 6.171 4096 33.561 33.761 7.220 7.301 8192 45.968 45.948 7.974 8.025 16384 55.163 54.907 8.400 8.426 32768 25.676 25.878 3.181 3.292 65536 25.725 23.909 3.309 3.208 131072 22.260 22.037 2.833 2.740 262144 21.106 21.978 2.853 2.819The big difference between I/O on the Y-MP and the T3D is that the I/O on the T3D is done by the mppexec agent that is just another Y-MP job competing with all other Y-MP jobs in the mix (ARSC's Y-MP is always running at more than 95% utilized). The degradation beyond 16K operations on the T3D must be due to some buffer other than the LDcache buffer because both writes are to files on the /tmp file system.
In all the above timings I am running something like:
parameter( ia1size = 1024 ) real a1( ia1size ) . . . call asnunit( iun, '-a /tmp/ess/fort.12', ier ) open( iun, form = 'unformatted' ) t1 = rtc() write(iun) a1 t1 = rtc() - t1 time = t1 * .000 000 006 666 speed = a1size / timeand in this case the compiler knows the length of the write from the declarations of the array a1. But if I rewrite the write statement in a more flexible way to:
write( iun ) ( a1( i ), i = 1, ia1size )now the performance on the T3D goes to hell because the T3D compiler doesn't treat as a special case the implied do on the I/O statement like the Y-MP compiler does.
Table 3 Y-MP and T3D times (MW/sec) for unformatted I/O with 2nd read/write construct array size /u1 file system /tmp file system (in words) reads writes reads writes ---------- ------ ------ ------ ------ 1024 0.066 0.067 13.361 13.480 2048 0.067 0.067 22.439 22.442 4096 0.067 0.067 33.953 34.211 8192 0.067 0.067 46.392 46.397 16384 0.067 0.067 55.360 55.217 32768 0.066 0.066 25.136 25.456 65536 0.066 0.066 24.863 25.043 131072 0.066 0.066 21.975 21.816 262144 0.066 0.066 21.987 22.012I'm sure this is only a temporary difference between the Y-MP and T3D compilers and that in the future the T3D compilers will be as smart as the Y-MP compiler for such an implied do loop on the I/O construct. I/O is a complicated situation and if you find some insight or technique, I'm sure we'd all like to hear about it.
List of Differences Between T3D and Y-MPThe current list of differences between the T3D and the Y-MP is:
- Data type sizes are not the same (Newsletter #5)
- Uninitialized variables are different (Newsletter #6)
- The effect of the -a static compiler switch (Newsletter #7)
- There is no GETENV on the T3D (Newsletter #8)
- Missing routine SMACH on T3D (Newsletter #9)
- Different Arithmetics (Newsletter #9)
- Different clock granularities for gettimeofday (Newsletter #11)
- Restrictions on record length for direct I/O files (Newsletter #19)
- Implied DO loop is not "vectorized" on the T3D (Newsletter #20)
In newsletter #18 there is a list of CRI T3D optimization articles available from ARSC.
In Newsletter #19 there is a list of CUG articles on the T3D available from ARSC.
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.