The Indiana University AVIDD Cluster and the CCA SC04 Tutorial
General notes
Indiana University has two Linux clusters, one each at Bloomington and
Indianapolis. They are named AVIDD-B and AVIDD-I, and each includes
roughly a hundred dual-processor nodes, IBM x335 mainly.
Each has a couple of head nodes (sometimes called "user nodes"), IBM x345's.
We'll use either of the Bloomington or Indianapolis clusters. This won't make
much difference since both clusters are run by the same group, have the same
software configuration, and IU-Bloomington and IUPUI are connected
by fast Gbit/sec optical networks.
The systems information for the clusters is at
http://www.indiana.edu/~rats/research/avidd/index.shtml.
[Right now, that web page is password protected; I'll either get it opened
up, or will mirror the nonsensitive portions here]
The head node to use is the second head node bh2: 129.79.228.232. You can
use the rollover address avidd-b.iu.edu but it may take you to the (currently
overloaded) first head node bh1: 129.79.228.231.
Uptimes and maintenance
The Indianapolis cluster is going to be moved to another building during
13-24 September, and it is likely that will put additional pressure on the
Bloomington cluster for that time period. The Bloomington cluster is scheduled
for maintenance on
- 20 September
- Some date TBP in October
- Some date TBP in November
Disk space and utilization
- The CCA tutorial staff have their home directories and shared space set up at
/N/I/cca . This is NFS mounted and visible from both the Bloomington
(avidd-b) and the Indianapolis (avidd-i) clusters.
- All users have a default 1GB quota. This is generally not enough for builds, etc.
A couple of people (Boyana and Ben?) have a 2GB quota.
- Try to use one of the scratch spaces for doing builds before the tutorial.
- GPFS: only use this if you are insane. It should be eliminated before SC04 (actually, it
never should have been put on the system - but that's another story). It is 1.6 TB in size.
- NFS-exported scratch: This is only 25 GB in size, located at /N/sharedscr. It is
visible from all of the nodes on both clusters, and is purged every 15 days or sooner.
- Head node scratch space: /scr is about 28 Gbytes and only visible from the given head node.
- Compute node scratch space: /scr is about 13 Gbytes and only visible from the given compute node.
The ACTS tutorial really stressed the I/O system during C++ builds. What we'll try to do is
set up a system that assigns each student a compute node, and they can do their builds on that local
compute node's scratch directory (it won't help much if everyone uses the head node scratch
space, since there's only two head nodes).
Software on AVIDD
Try to get any special requests in early.
- For parallel debugging, TotalView and Vampir are available. Vampir is limited
to 32 processors by license, however. TotalView is in /N/hpc/totalview/bin/totalview.
- Compilers: the cluster has Intel, PGI, and the usual GNU stuff.
- Different MPICH builds are on the cluster:
- /usr/local/mpich-gcc/
- /usr/local/mpich-intel/
- /usr/local/mpich-pgi/
All three of those use the Myrinet network. The problem with this is that
we must use the PBS system to launch parallel jobs. If more than two jobs try to use
Myrinet simultaneously, the nodes get wedged and nothing runs.
I'll see if we can build ethernet
network versions of those - it will be slower, but will avoid the nightmare
situation of one mistake shutting down the system for everyone.
Either I will build, or ask
the UITS team to build, a version of MPICH using the Intel 8.0 Fortran
compilers and the latest GCC compilers for C and C++. If any other build of MPICH is
needed, let me know.
- The batch scheduler is PBS Pro with the Maui Scheduler. For the tutorial we'll want
to use the scheduler to get "interactive nodes". But then we must use PBS to avoid
hanging the Myrinet network - more than two accessing it at a time causes a hang for
the whole system. We may just have everyone use ethernet for the MPI network instead.
Please forward any special requests through Bramley. There is the one in a million
chance that he can answer the question, but more likely he can get the question to
the right person(s).
Please give a citation to the university ("work supported in part by facilities provided by
Indiana University through NSF grants CDA-9601632 and EIA-0202048") for any
publications that come out of this - and let me know about them.
Randall Bramley
- Initiated: Sat Sep 4 16:19:37 EST 2004