The PBS batch system can be used to manage the nodes allocation in a cluster of hosts. For example, using a particular job script, it's possible to communicate to the MPI launcher program (mpirun) the number and the list of nodes that PBS has allocated for the whole job as requested from the user. The PBS server will not run more jobs on the busy nodes until the end of the current job. Here is an example of script to do this over a Myrinet network using a implementation of MPICH over GM (a proprietary protocol developed by Myricom); in such a script normally you have only to change the number of nodes required, the working directory and the name executable MPI program.
#!/bin/sh
#! example of job file to submit parallel MPI applications
#! lines starting with #PBS are options for the qsub command
#! Number of nodes (in this case I require 4 nodes with 2 CPU each)
#! The total number of nodes passed to mpirun will be nodes*ppn
#PBS -l nodes=4:ppn=2
#! Name of output files for std output and error;
#! if non specified defaults are <job-name>.o<job number> and <job-name>.e<job-number>
#PBS -e test.err
#PBS -o test.log
#! Mail to user when job terminate or abort
#PBS -m ae
#!change the working directory (default is home directory)
cd <working directory>
echo Running on host `hostname`
echo Time is `date`
echo Directory is `pwd`
echo This jobs runs on the following processors:
echo `cat $PBS_NODEFILE`
#! Counts the number of processors
NPROCS=`wc -l < $PBS_NODEFILE`
echo This job has allocated $NPROCS nodes
#! Create a machine file for Myrinet
echo $NPROCS >$PBS_JOBID.nodefile
awk '{if ($0 in vett) print $0 " " 7; else print $0 " " 6 ; vett[$0]="x"}' $PBS_NODEFILE >>$PBS_JOBID.nodefile
#! Run the parallel MPI executable (change the default a.out)
/usr/local/mpi-myri/bin/mpirun.ch_gm --gm-v --gm-f $PBS_JOBID.nodefile --gm-kill 30 -np $NPROCS a.out
rm $PBS_JOBID.nodefile
A better solution is to substitute the standard MPI launcher (mpirun) which uses the rsh mechanism to run the application on the nodes with new launcher program using the task manager library of PBS to spawn copies of the executable on all the nodes. The goals of a such program are:
One implementation of this scheme for the Myricom net is the program mpiexe which integrates PBS with the MPICH implementation over GM. In this case the example script can be simplified:
#!/bin/sh #! example of job file to submit with qsub #! lines starting with #PBS are options for the qsub command #! Number of nodes (8 in this case) #PBS -l nodes=4:ppn=2 #! Name of output files for std output and error; #! if non specified defaults are <job-name>.o<job number> and <job-name>.e<job-number> #PBS -e test.err #PBS -o test.log #! Mail to user when job terminate or abort #PBS -m ae #! This job's working directory echo Working directory is $PBS_O_WORKDIR #!cd <working directory> echo Running on host `hostname` echo Time is `date` echo Directory is `pwd` echo This jobs runs on the following processors: echo `cat $PBS_NODEFILE` #! option to kill all the processes if one of them dies export GMPIRUN_KILL=1 # or in csh: setenv GMPIRUN_KILL 1 export GMPIRUN_VERBOSE=1 # or in csh: setenv GMPIRUN_VERBOSE 1 #! Run the parallel MPI executable - it's possible to redirect stdin/stdout of all processes #! using "<" and ">" - including the double quotes /usr/local/bin/mpiexec -bg a.out