Using the IU Computer Science Odin Cluster

This guide is for new users and tutorial participants. The odin cluster was purchased with NSF Research Infrastructure funds, so people who use the machine should cite NSF EIA-0202048 for papers or software artifacts that result.

Architecture

Odin nodes are Opterons with 64-bit addressing. This can cause trouble for codes that were built with the assumption that pointers and ints are the same length, and when compile problems occur you should check the type of each library used to make sure it's all 32-bit or all 64-bit versions. You can use the Unix "file" command for this.

Head nodes have 8 GB memory, compute nodes have 4 GB each. Storage is primarily on a SAN which has a 3.5 terabyte capacity. If you want to use parallel I/O, it would have to be writing separate files to /tmp on each machine. Otherwise I/O will be limited by the 10GB/sec Infiniband switch.

Naming conventions

Odin has 128 nodes, with 2 processors each. In addition there are three head nodes: The naming convention for the compute nodes is

Compilation

Linking can only be done on a head node because some needed libraries are missing on the compute nodes. You can still compile to create object (.o) files, but the final linking must be on a head node.

Batch management

SLURM is being run, but is not strictly enforced. Instead it is used mainly as a mechanism to informally reserve nodes so that people don't all end up on the first compute node. Each node has an externally visible IP address (to support some distributed computing research here at IU). However, first login to one of the head nodes and run slurm to get a node allocation as follows:

  1. Allocate the processors and start the shell by srun -N 2 -A where this gives you 2 nodes (-n 4 gives 4 processors.) This will start a new shell for you on the head node, and exiting that shell will kill your reservation of nodes automatically. The default allocation time is for four days.

  2. Find out which nodes you have by squeue | grep trainXXX where trainXXX is your assigned login name.

  3. Compile your jobs as needed on the head node. Before you launch any MPI jobs, be sure to ssh over to one of your assigned nodes from step 2. Most MPI´s will automatically start a job on the invocation node, and would mean everyone running a compute process on the head node.

MPI

LAM MPI is the default (/usr/bin) version, but for the CCA tutorial you will need to use the one built in /san/cca/mpich_gcc_intel_PIC . [I should set this up in the default path for each trainee account before hand].

MPI Case 1: LAM/MPI

If you are going to use LAM/MPI, the procedure is to use the mpdboot mechanism. Those steps are

  1. ssh to the first processor allocated to you by srun; otherwise mpd will start jobs on the head node, which you don't want.

  2. Create the .mpd.conf file with one line: secretword=whatever Don't use a good password, this is not cryptographically secure. chmod 600 .mpd.conf

  3. Create the mpd.hosts file in your home directory, and make sure it lists the nodes you've been allocated by srun

  4. mpdboot -n 2 starts 2 node daemons (which is 4 processors)

  5. mpdtrace -l shows which ones were grabbed; it´s a good idea to check to make sure the daemons were successfully started.

  6. Launch your job via /san/mpich2/bin/mpdrun -l -np 4 /san/cca/trainXXX/a.out &

  7. When finished, mpdallexit to kill off the daemons.

MPI Case 2: MPICH

CCA requires versions built with -fPIC and dynamic libraries. LAM does not support Fortran 90 modules, which occur in the CCA tutorial code. The MPI version to use is in /san/cca/mpich_gcc_intelPIC, and it was compiled with GCC 3.4.3 for C and C++, Intel ver. 9.0 Fortran90. ROMIO and the MPE libraries were also built in that directory.

  1. Create a machine_list file with the nodes you were assigned listed:

    odin120 odin121 ...

  2. Then use the usual mpirun script /san/cca/mpich_intel_PIC/bin/mpirun -np 4 -machine_list machines a.out where a.out is the MPI executable. Specifying the full path will help prevent the common mistake of compiling with one MPI and using the launch mechanism of another.

Other Software

The overall OS is Gentoo Linux. Which may be partly responsible for ...

Libraries are sometimes in unexpected places on odin. We have

GCC 4.x has not yet been successfully built on odin, and it may be because of some nonstandard placement of the libraries. If possible Bramley will try to build it and create a mpich_gfortran that uses all GCC compilers.

Cleaning up

To release the queue: a) The queue is automatically exited if you are running in interactive mode or b) use scancel to cancel the reservation.

We intend to keep your accounts and files intact for at least two weeks after the end of the tutorial, to give you time to transfer them over to your home machine for future delectation. After that, I will probably delete all .o, .so, and .a files and then tar the whole thing over to our HPSS tape system, just in case someone needs the files later. Be sure to not leave anything private laying around in the directories like your credit card numbers or the bank account that you gave the nice Nigerian gentleman a few weeks ago.

Last modified: Wed Aug 24 10:36:36 EST 2005