Sun Grid Engine - quick user guide
Introduction
This page gives a quick overview of computational facilities available to users in BETA and LCI, and explains how to use them with the SunGridEngine scheduling software.
An extensive overview of all the features of SGE can be found at the
Sun website
.
Available clusters
- beta
cluster: 5 machines, two 2GHz CPUs each, 4GB memory, running Linux. Available for members of the beta lab.
This should probably be used only via SGE (with a share-based scheduling system that will actually work, as opposed to the current first-come-first-serve scheme)
- icics
cluster: 13 machines, two 3GHz CPUs each, 2GB memory, running Linux. Available for all members of the department.
Some people run stuff locally on these machines. We could still use SGE on top of that (it dispatches jobs based on load), but there is no guarantee to get 100% CPU time on the nodes you're running on.
- arrow
cluster: 50 machines, two 3.2 GHz CPUs each, 2GB memory, running Linux.
This cluster is tied to Kevin Leyton-Brown's CFI grant for research on empirical hardness models, but is also available to other users in the department when it is idle.
Details about the machines, their configuration, and their names:
Ganglia
The Arrow cluster
Jobs running on the
arrow cluster belong to one of four priority classes. Jobs are scheduled (selected to be run) pre-emptively by priority class, and then evenly among users within a priority class. (Note that scheduling among users is done on the basis of CPU usage, not on the basis of the number of jobs submitted. Thus a user who submits many fast jobs will be scheduled more often than a user at the same priority class who submits many slow jobs.) Because of the preemptive scheduling, users submitting to a lower priority class may see high latency (it may take days or weeks before a queued job is scheduled). On the other hand, these lower priority jobs will be allocated all 100 CPUs when no higher-priority jobs are waiting. All users should feel free to submit as many jobs as they like (but rather use few big
array jobs than many single jobs), as doing so will not interfere with the cluster's ability to serve its primary purpose.
Priority classes are:
- Urgent: intended for very occasional use, mostly around paper deadlines. The cluster is limited to 10 such jobs at any time.
- eh (for empirical hardness): jobs which pertain to the particular project mentioned in the CFI grant under which the cluster was funded.
- ea (for empirical algorithmics): studies on the empirical properties of algorithms (i.e., "the 'E' in BETA"), but not part of the project described above.
- general: jobs which do not fall into one of the above categories. We ask that these jobs be relatively short in duration (although there can be arbitrarily many of them): since new jobs can only be scheduled when a processor becomes idle, excessively long low-priority jobs can lead to starvation of high-priority jobs.
- low: jobs of particularly low priority. This priority class can be used to submit a huge amount of jobs that will give way to any other jobs in the queue if there are any.
In order to submit in any priority class (even 'general'), access for that class must be explicitly granted to your user account. To request access, please contact
Frank Hutter,
Lin Xu or
Kevin Leyton-Brown
.
How do I submit jobs?
- For the arrow cluster, add the line
source /cs/beta/lib/pkg/sge-6.0u7_1/default/common/settings.csh
to your configuration file (e.g. ~/csh_init/.cshrc
). For the beta cluster, the appropriate line to add is
source /cs/beta/lib/pkg/sge/beta_grid/common/settings.csh
but we may completely get rid of this configuration and have it all in one. Currently, you work solely with the one cluster that is indicated by this line in the configuration file - there is no easy way to go back and forth (but this will hopefully change).
- ssh onto a submit host. For beta, submit hosts are (we're not sure), for arrow it is samos (which is now another name for arrow)
- A job is submitted to the cluster in the form of a shell (.sh) script.
- You can either submit single or array jobs. An example for a single job would e.g. be the one-line script
echo 'Hello world.'
If this is the content of the file helloworld.sh, you can submit a job by typing
qsub -cwd -o <outfiledir> -e <errorfiledir> helloworld.sh
on the command line. This will create a new job with an automatically assigned job number <jobnumber> that is queued and eventually run on a machine in the cluster. When the job runs, it will write output (stdout) to the file <outfiledir>/helloworld.sh.o<jobnumber> It will also create a file <errorfiledir>/helloworld.sh.e<jobnumber> and write stderr to that file. In the above case, "Hello world." will be written to the outfile and the errorfile will be empty. If you don't want output files (for array jobs you can easily end up with thousands) specify /dev/null as <>, and similarly for error files.
If you just want to see if the cluster works for you, use the following command. When the job finishes, an output file will be written to the current directory.
qsub -cwd -o . -e . ~kevinlb/World/helloWorld.sh
What is an appropriate job size?
- Short Jobs Please!: To the extent possible, submit short jobs. We have not configured SGE as load balancing software, because allowing jobs to be paused or migrated can affect the accuracy of process timing. Once a job is running it runs with 100% of a CPU until it's done. Thus, if you submit many long jobs, you will block the cluster for other users. Due to the share-based scheduling we've set up, your overall share of computational time will not be larger if you submit larger jobs (e.g., shell scripts that invoke more runs of your program). However, while longer jobs will not increase your throughput, they will increase latency for other users.
- Not ridiculously short jobs though!: jobs that are extremely short (eg 0.1 second) might take longer to be dispatched than they take to run, again hurting performance if a gigantic number of such jobs are in the queue. An ideal job length is on the order of a minute (i.e., if your jobs are extremely short, it's best to group them to achieve this sort of length). This would mean that higher-priority jobs would be able to access the cluster within about a second on average, but that job dispatching would not overwhelm the submit host.
- Like it or not, you can't submit ridiculously long jobs: On the arrow cluster, jobs that run longer than 25 hours are automatically killed, but rather try to keep jobs to minutes or hours. This makes it easiest for SGE to help share resources in a fair way. See the instructions on array jobs for how to split up a really long job.
Array jobs
Array jobs are useful for submitting many similar jobs. In general, you should try to submit a single array job for each big run you're going to do, rather than (e.g.,) invoking qsub in a loop. This makes it easier to pause or delete your jobs, and also imposes less overhead on the scheduler. (In particular, the scheduler creates a directory for every job, so if you submit 10,000 jobs there will be 10,000 directories created before your job can start running. An array job containing 10,000 elements creates only a single directory.)
An example of an array job is the one-line script
echo 'Hello world, number ' $SGE_TASK_ID
If this is the content of the file many-helloworld.sh, you can submit an array job by typing
qsub -cwd -o <outfiledir> -e <errorfiledir> -t 1-100 many-helloworld.sh
on the command line, where the range 1-100 is chosen arbitrarily here. This will create a new array job with an automatically assigned job number <jobnumber> and 100 entries that is queued. Each entry of the array job will eventually run on a machine in the cluster - the <i>th entry will be called <jobnumber>.<i>. Sungrid Engine treats every entry of an array job as a single job, and when the <i>th entry is called assigns <i> to the variable $SGE_TASK_ID. You may use this variable to do arbitrarily complex things in your shell script - an easy option is to index a file and execute the <i>th line with the <i>th job.
How to monitor, control, and delete jobs
- The command qstat is used to check the status of the queue. It lists all running and pending jobs. There is an entry for each entry of a running array job, whereas pending parts of array jobs are listed in one line. qstat -f gives detailed information for each cluster node, qstat -ext more detailed information for each job. Try man qstat for more options.
- The command qmon can be used to get a graphical interface to monitor and control jobs. It's not great, though.
- The command qdel can be used to delete (your own) jobs. (syntax: qdel <jobnumber>). You can also delete single entries of array jobs (syntax qdel <jobnumber>.<i>).
Priority Classes
If you use the qsub syntax above on arrow, your job will be assigned to the default priority class associated with your user account. This cannot be the priority class Urgent. To use another priority class than your default for a job, use the following syntax (note the capital P):
qsub -cwd -o <outfiledir> -e <errorfiledir> -P <priorityclass> helloworld.sh
To submit in the urgent class, please use:
qsub -cwd -o <outfiledir> -e <errorfiledir> -P Urgent -l urgent=1 helloworld.sh
As mentioned above, the total number of urgent jobs is limited to 10. This is even true if those are the only jobs on the cluster - in that case (which probably will never happen because many users will have stuff waiting) you can still fill the rest of the cluster with "normal" jobs.
CPLEX and MATLAB license management
If your job uses CPLEX or Matlab, add
-l cplex=1
or
-l matlab=1
to your qsub command. This will ensure that we don't run more jobs than there are licenses. Instructions on how to use CPLEX are found
here. We're not exactly sure of the best way to run MATLAB without invoking X-windows, but the best we've been able to do is
matlab -nojvm -nodisplay -nosplash < inputfile.m
Right now the cluster allows 22 simultaneous CPLEX processes and (separately) 20 simultaneous Matlab processes. Using the -l syntax above ensures that no more than this number of processes run. If you don't use the syntax, you run the risk of invoking more processes than there are licenses, causing some of your jobs to fail and thus requiring you to rerun jobs. The department as a whole has 100 Matlab licenses, of which many are in use at any time; thus, running a Matlab job without the -l syntax runs a high chance of failure. If there is a high level of Matlab usage in the department, it's also possible that you will run out of licenses even if you do use the -l flag, as this doesn't actually reserve the licenses for you. We have 22 CPLEX licenses, but they should only be used on this cluster; thus again you'll need to use -l but you shouldn't have to worry about users elsewhere in the dept. To check the number of Matlab licenses available you can type:
/cs/local/generic/lib/pkg/matlab-7.2/etc/lmstat -a
To change the number of available Matlab licenses when using -l matlab=1, an administrator has to change the cluster configuration; details on how to do this are on the administration page.
If you want to manually limit the number of processes that can run simultaneously to a different number, you can use
-l max10=1
,
-l max20=1
,
-l max30=1
or
-l max40=1
. This will allow you to run 10, 20, 30 or 40 jobs at once. (Note that if multiple people running simultaneous jobs both use the same flag, the number of simultaneous jobs will be limited across all users.)
Where do Arrow jobs run?
Right here:
Administration
For details on how to administer the cluster (requires admin access), look at
SunGridEngineAdmin.
--
FrankHutter and Lin Xu - 02 May 2006