Sun Grid Engine - quick user guide

This page is part of the EmpiricalAlgorithmics web.

Introduction

This page gives a quick overview of computational facilities available to users in BETA and LCI, and explains how to use them with the SunGridEngine scheduling software.
An extensive overview of all the features of SGE can be found at the Sun website.

Available clusters

Note: As of Jan-05-2011, samos & ganglia no longer appear to be avaliable, and the submit host for arrow is arrowhead.cs.ubc.ca

beta cluster: 5 machines, two 2GHz CPUs each, 4GB memory, running Linux. Available for members of the beta lab.
This should probably be used only via SGE (with a share-based scheduling system that will actually work, as opposed to the current first-come-first-serve scheme)
icics cluster: 13 machines, two 3GHz CPUs each, 2GB memory, running Linux. Available for all members of the department.
Some people run stuff locally on these machines. We could still use SGE on top of that (it dispatches jobs based on load), but there is no guarantee to get 100% CPU time on the nodes you're running on.
arrow cluster: 50 machines, two 3.2 GHz CPUs each, 2GB memory, running Linux.
This cluster is tied to Kevin Leyton-Brown's CFI grant for research on empirical hardness models, but is also available to other users in the department when it is idle.

Details about the machines, their configuration, and their names: Ganglia

The Arrow cluster

Jobs running on the arrow cluster belong to one of four priority classes. Jobs are scheduled (selected to be run) pre-emptively by priority class, and then evenly among users within a priority class. (Note that scheduling among users is done on the basis of CPU usage, not on the basis of the number of jobs submitted. Thus a user who submits many fast jobs will be scheduled more often than a user at the same priority class who submits many slow jobs.) Because of the preemptive scheduling, users submitting to a lower priority class may see high latency (it may take days or weeks before a queued job is scheduled). On the other hand, these lower priority jobs will be allocated all 100 CPUs when no higher-priority jobs are waiting. All users should feel free to submit as many jobs as they like (but rather use few big array jobs than many single jobs), as doing so will not interfere with the cluster's ability to serve its primary purpose.

Note: much of the information on priority classes are out of date. The section will be updated soon. Priority classes are:

Urgent: intended for very occasional use, mostly around paper deadlines. The cluster is limited to 10 such jobs at any time. Please use the urgent consumable (see 'priority classes' below).
eh (for empirical hardness): jobs which pertain to the particular project mentioned in the CFI grant under which the cluster was funded.
ea (for empirical algorithmics): studies on the empirical properties of algorithms (i.e., "the 'E' in BETA"), but not part of the project described above.
general: jobs which do not fall into one of the above categories. We ask that these jobs be relatively short in duration (although there can be arbitrarily many of them): since new jobs can only be scheduled when a processor becomes idle, excessively long low-priority jobs can lead to starvation of high-priority jobs.
low: jobs of particularly low priority. This priority class can be used to submit a huge amount of jobs that will give way to any other jobs in the queue if there are any.
kpm: jobs of Kevin Murphy's students.
kpmUrgent: urgent jobs of Kevin Murphy's students. If you use this priority class, please use the kpmUrgent consumable (i.e. add -l kpmUrgent=1 to your SGE command; see 'priority classes' below); if you don't you will block the cluster for everyone else.
klb: jobs of Kevin Leyton-Brown's students.

In order to submit in any priority class (even 'general'), access for that class must be explicitly granted to your user account. To request access, please contact Kevin Leyton-Brown

. When he grants you access, he, or Steve Ramage can add you as a user.

How do I submit jobs?

For the (new) arrow cluster, add the line

   source /cs/sungridengine/arrow/common/settings.csh

to your configuration file (e.g. ~/csh_init/cshrc), or add

   source /cs/beta/lib/pkg/sge-6.0u7_1/default/common/settings.csh

to access the old arrow cluster. For the beta cluster, simply type the following command in a shell:

   source /cs/sungridengine/beta-icics/common/settings.sh

ssh onto a submit host. For beta, submit hosts are (we're not sure), for arrow it is samos (which is now another name for arrow). Note that the tech staff has restricted ssh access to some hosts of the department (for security reasons), and the samos and arrowNN are affected by that. SSH access to arrowNN is restricted to connections coming from samos (note that you shouldn't actually be connecting to anything other than arrow01, and even that machine may have some restrictions), and SSH access to samos is restricted to connections from begbie and okanagan.
A job is submitted to the cluster in the form of a shell (.sh) script.
You can either submit single or array jobs. An example for a single job would e.g. be the one-line script
```
 echo 'Hello world.'
 
```
If this is the content of the file helloworld.sh, you can submit a job by typing
```
 qsub -cwd -o <outfiledir> -e <errorfiledir> helloworld.sh
 
```
on the command line. This will create a new job with an automatically assigned job number <jobnumber> that is queued and eventually run on a machine in the cluster. When the job runs, it will write output (stdout) to the file <outfiledir>/helloworld.sh.o<jobnumber> It will also create a file <errorfiledir>/helloworld.sh.e<jobnumber> and write stderr to that file. In the above case, "Hello world." will be written to the outfile and the errorfile will be empty. If you don't want output files (for array jobs you can easily end up with thousands) specify /dev/null as <>, and similarly for error files.

If you just want to see if the cluster works for you, use the following command. When the job finishes, an output file will be written to the current directory.
```
   qsub -cwd -o . -e . ~kevinlb/World/helloWorld.sh
   
```

If you don't want to store the output and error files for your job (be cautious though!), then you can throw them away directly, i.e. send them to /dev/null:
```
   qsub -cwd -o /dev/null -e /dev/null script.sh
   
```

If you want to submit to a specific queue, for example to have same-power machines, specify -q queue_name. The name of icics is icics.q, and the name of beta is beta.q:
```
   qsub -cwd -o /dev/null -e /dev/null -q icics.q script.sh
   
```

If you want your script to include some more fancy commands rather than just call your program, you have to add the line #$ -S /bin/sh to your script, or you have to add -S /bin/sh as an argument to qsub. For example, if your script has a while, it won't work unless your script has the following two lines at the beginning:
```
   #!/bin/sh
   #$ -S /bin/sh
   
```
or it has only the first line #!/bin/sh, and you call it like this:
```
   qsub -cwd -S /bin/sh script.sh
   
```

Similarly, you can submit a Perl script instead of a shell script (Yeah, I like that too!), you just have to start your script by
```
   #!/cs/local/bin/perl
   #$ -S /cs/local/bin/perl
   
```
or you can add -S /cs/local/bin/perl to your qsub command and remove the second line in the script (the first one still needs to be there).

Probably it would work with other scripting languages too!

I hate shell scripts! Can I use Perl instead? Or some other scripting language?

Luckily, YES! Read the previous section "How do I submit jobs".

What is an appropriate job size?

Short Jobs Please!: To the extent possible, submit short jobs. We have not configured SGE as load balancing software, because allowing jobs to be paused or migrated can affect the accuracy of process timing. Once a job is running it runs with 100% of a CPU until it's done. Thus, if you submit many long jobs, you will block the cluster for other users. Due to the share-based scheduling we've set up, your overall share of computational time will not be larger if you submit larger jobs (e.g., shell scripts that invoke more runs of your program). However, while longer jobs will not increase your throughput, they will increase latency for other users.
Not ridiculously short jobs though!: jobs that are extremely short (eg 0.1 second) might take longer to be dispatched than they take to run, again hurting performance if a gigantic number of such jobs are in the queue. An ideal job length is on the order of a minute (i.e., if your jobs are extremely short, it's best to group them to achieve this sort of length). This would mean that higher-priority jobs would be able to access the cluster within about a second on average, but that job dispatching would not overwhelm the submit host.
As of Feb 2014, the arrow cluster is not as heavily contended as it once was. The usage is normally coordinated by gentlemanly agreement based on group priorities. Arrow will not try to kill your long jobs as well.

Array jobs

Array jobs are useful for submitting many similar jobs. In general, you should try to submit a single array job for each big run you're going to do, rather than (e.g.,) invoking qsub in a loop. This makes it easier to pause or delete your jobs, and also imposes less overhead on the scheduler. (In particular, the scheduler creates a directory for every job, so if you submit 10,000 jobs there will be 10,000 directories created before your job can start running. An array job containing 10,000 elements creates only a single directory.)

An example of an array job is the one-line script

   echo 'Hello world, number ' $SGE_TASK_ID

If this is the content of the file many-helloworld.sh, you can submit an array job by typing

   qsub -cwd -o <outfiledir> -e <errorfiledir> -t 1-100 many-helloworld.sh

on the command line, where the range 1-100 is chosen arbitrarily here. This will create a new array job with an automatically assigned job number <jobnumber> and 100 entries that is queued. Each entry of the array job will eventually run on a machine in the cluster - the th entry will be called <jobnumber>.. Sungrid Engine treats every entry of an array job as a single job, and when the th entry is called assigns to the variable $SGE_TASK_ID. You may use this variable to do arbitrarily complex things in your shell script - an easy option is to index a file and execute the th line with the th job.

How to monitor, control, and delete jobs

The command qstat is used to check the status of the queue. It lists all running and pending jobs. There is an entry for each entry of a running array job, whereas pending parts of array jobs are listed in one line. qstat -f gives detailed information for each cluster node, qstat -ext more detailed information for each job. Try man qstat for more options.
The command qmon can be used to get a graphical interface to monitor and control jobs. It's not great, though.
The command qdel can be used to delete (your own) jobs. (syntax: qdel <jobnumber>). You can also delete single entries of array jobs (syntax qdel <jobnumber>.).

Priority Classes

If you use the qsub syntax above on arrow, your job will be assigned to the default priority class associated with your user account. This cannot be the priority class Urgent or the class kpmUrgent. To use another priority class than your default for a job, use the following syntax (note the capital P):

   qsub -cwd -o <outfiledir> -e <errorfiledir> -P <priorityclass> helloworld.sh

To submit in the urgent class, please use:

   qsub -cwd -o <outfiledir> -e <errorfiledir> -P Urgent -l urgent=1 helloworld.sh

To submit in the kpmUrgent class, please use:

   qsub -cwd -o <outfiledir> -e <errorfiledir> -P kpmUrgent -l kpmUrgent=1 helloworld.sh

As mentioned above, the total number of urgent jobs is limited to 10. This is even true if those are the only jobs on the cluster - in that case (which probably will never happen because many users will have stuff waiting) you can still fill the rest of the cluster with "normal" jobs.

CPLEX and MATLAB license management

If your job uses CPLEX or Matlab, add -l cplex=1 or -l matlab=1 to your qsub command. This will ensure that we don't run more jobs than there are licenses. Instructions on how to use CPLEX are found here. We're not exactly sure of the best way to run MATLAB without invoking X-windows, but the best we've been able to do is

   matlab -nojvm -nodisplay -nosplash < inputfile.m

Another good approach is to use the Matlab compiler. The advantage of this is that compiled Matlab code does not require licenses to run.

Right now the cluster allows 22 simultaneous CPLEX processes and (separately) 20 simultaneous Matlab processes. Using the -l syntax above ensures that no more than this number of processes run. If you don't use the syntax, you run the risk of invoking more processes than there are licenses, causing some of your jobs to fail and thus requiring you to rerun jobs. The department as a whole has 100 Matlab licenses, of which many are in use at any time; thus, running a Matlab job without the -l syntax runs a high chance of failure. If there is a high level of Matlab usage in the department, it's also possible that you will run out of licenses even if you do use the -l flag, as this doesn't actually reserve the licenses for you. We have 22 CPLEX licenses, but they should only be used on this cluster; thus again you'll need to use -l but you shouldn't have to worry about users elsewhere in the dept. To check the number of Matlab licenses available you can type:

   /cs/local/generic/lib/pkg/matlab-7.2/etc/lmstat -a

To change the number of available Matlab licenses when using -l matlab=1, an administrator has to change the cluster configuration; details on how to do this are on the administration page.

If you want to manually limit the number of processes that can run simultaneously to a different number, you can use -l max10=1, -l max20=1, -l max30=1 or -l max40=1. This will allow you to run 10, 20, 30 or 40 jobs at once. (Note that if multiple people running simultaneous jobs both use the same flag, the number of simultaneous jobs will be limited across all users.)

Memory intensive jobs

If your job may require significant amounts of memory, please use the consumable memheavy. You use this just like the Matlab or CPLEX consumables, i.e. -l memheavy=1. Only one job using this consumable will be scheduled on each machine.

Multi-core jobs (Arrow Cluster)

If your job parallelizes such that it will use both CPUs on a single node and potentially impair the performance of other jobs assigned to the same host, consider using the parallel environment fillup. This will assign 2 (i.e., both) slots on the same machine to your job. You use this by specifying -pe fillup 2 in your submit command.

Manually limiting the number of your jobs (Arrow Cluster)

By using one of the consumables max10, max20, max30, or max40, you can manually limit the number of jobs you run at once. This is useful if you have long jobs and are worried about blocking the whole cluster with your jobs. As usual for consumables, use e.g. -l max10=1. Note that you will compete with other users requesting the same consumable (only 10 max10 jobs can run, regardless of who submits them).

Manually limiting the number of your jobs (ICICS/BETA Clusters)

Here is another way to manually limit the number of jobs you are running, since the maxK consumables are not currently implemented for the ICICS & BETA clusters. Use one of the following options when running qsub:

[-h] place user hold on job
[-hold_jid job_identifier_list] define jobnet interdependencies

The [-h] option will hold your array of jobs in the queue until there are NO other jobs running on the SGE. This isn't really recommended unless you want to be REALLY nice and make sure NO-ONE is using the cluster when you do.

The best option is to use the [-hold_jid job_identifier_list], which will hold your array of jobs in the queue until a SPECIFIC job(s) is finished, defined by the job_identifier_list. So, to manually limit the number of jobs you are running (lets say 10 jobs), split your array up to blocks of 10 and submit them one at a time with the [-hold_jid jobID] option, where jobID is the ID of the previous block of 10. This will ensure that you never use more than 10 machines at a time, and it will also let you use any 10 machines.

What is the configuration of the machines, and how busy are they?

See http://samos.cs.ubc.ca/ganglia/

for machine load. To get the configuration of a machine, click on the appropriate cluster, and then on the machine of interest.

Common Problems

Your job has the status "Eqw"
- Try _dos2unix_ing the file that your queue commands are in

Where do Arrow jobs run?

Right here:

Arrow users: CFI Blurbs

The Arrow cluster was funded under a CFI grant. This grant requires us to complete annual reports explaining how this infrastructure is being used. This information is summarized on the page ArrowClusterCFISummaries.

CVS

How to set up CVS.

Administration

For details on how to administer the cluster (requires admin access), look at SunGridEngineAdmin. For software installed on arrow (or software you need installed there), see ArrowSoftware.

-- FrankHutter and Lin Xu - 02 May 2006

Topic revision: r42 - 2014-02-16 - seramage