Difference: SunGridEngine (1 vs. 44)

Revision 442015-03-11 - HolgerHoos

Line: 1 to 1
 

Sun Grid Engine - quick user guide

This page is part of the EmpiricalAlgorithmics web.

Line: 33 to 33
 

How do I submit jobs?

Changed:
<
<
  • For the (new) arrow cluster, add the line
>
>
  • For the (new) arrow cluster, add the following line to .bashrc (and make sure to use the bash shell):
 
   source /cs/sungridengine/arrow/common/settings.sh
   

Revision 432014-05-08 - seramage

Line: 1 to 1
 

Sun Grid Engine - quick user guide

This page is part of the EmpiricalAlgorithmics web.

Line: 12 to 12
 

Available clusters

Added:
>
>
Note: As of May-07-2014, no one seems to use the beta or iccs cluster, at least in the BETA lab. Please see previous revisions of this document for how to run on these clusters, but the information was removed.
 
Note: As of Jan-05-2011, samos & ganglia no longer appear to be avaliable, and the submit host for arrow is arrowhead.cs.ubc.ca
Changed:
<
<
  • beta cluster: 5 machines, two 2GHz CPUs each, 4GB memory, running Linux. Available for members of the beta lab.
    This should probably be used only via SGE (with a share-based scheduling system that will actually work, as opposed to the current first-come-first-serve scheme)
  • icics cluster: 13 machines, two 3GHz CPUs each, 2GB memory, running Linux. Available for all members of the department.
    Some people run stuff locally on these machines. We could still use SGE on top of that (it dispatches jobs based on load), but there is no guarantee to get 100% CPU time on the nodes you're running on.
  • arrow cluster: 50 machines, two 3.2 GHz CPUs each, 2GB memory, running Linux.
>
>
  • arrow cluster: 55 machines, two 3.2 GHz CPUs each, 2GB memory, running Linux.
  This cluster is tied to Kevin Leyton-Brown's CFI grant for research on empirical hardness models, but is also available to other users in the department when it is idle.

Details about the machines, their configuration, and their names: Ganglia

Line: 29 to 28
  Jobs running on the arrow cluster belong to one of four priority classes. Jobs are scheduled (selected to be run) pre-emptively by priority class, and then evenly among users within a priority class. (Note that scheduling among users is done on the basis of CPU usage, not on the basis of the number of jobs submitted. Thus a user who submits many fast jobs will be scheduled more often than a user at the same priority class who submits many slow jobs.) Because of the preemptive scheduling, users submitting to a lower priority class may see high latency (it may take days or weeks before a queued job is scheduled). On the other hand, these lower priority jobs will be allocated all 100 CPUs when no higher-priority jobs are waiting. All users should feel free to submit as many jobs as they like (but rather use few big array jobs than many single jobs), as doing so will not interfere with the cluster's ability to serve its primary purpose.
Changed:
<
<
Note: much of the information on priority classes are out of date. The section will be updated soon. Priority classes are:
  1. Urgent: intended for very occasional use, mostly around paper deadlines. The cluster is limited to 10 such jobs at any time. Please use the urgent consumable (see 'priority classes' below).
  2. eh (for empirical hardness): jobs which pertain to the particular project mentioned in the CFI grant under which the cluster was funded.
  3. ea (for empirical algorithmics): studies on the empirical properties of algorithms (i.e., "the 'E' in BETA"), but not part of the project described above.
  4. general: jobs which do not fall into one of the above categories. We ask that these jobs be relatively short in duration (although there can be arbitrarily many of them): since new jobs can only be scheduled when a processor becomes idle, excessively long low-priority jobs can lead to starvation of high-priority jobs.
  5. low: jobs of particularly low priority. This priority class can be used to submit a huge amount of jobs that will give way to any other jobs in the queue if there are any.
  6. kpm: jobs of Kevin Murphy's students.
  7. kpmUrgent: urgent jobs of Kevin Murphy's students. If you use this priority class, please use the kpmUrgent consumable (i.e. add -l kpmUrgent=1 to your SGE command; see 'priority classes' below); if you don't you will block the cluster for everyone else.
  8. klb: jobs of Kevin Leyton-Brown's students.
In order to submit in any priority class (even 'general'), access for that class must be explicitly granted to your user account. To request access, please contact Kevin Leyton-Brown. When he grants you access, he, or Steve Ramage can add you as a user.
>
>
To request access, please contact Kevin Leyton-Brown. When he grants you access, he, or Steve Ramage can add you as a user.
 

How do I submit jobs?

  • For the (new) arrow cluster, add the line

Changed:
<
<
source /cs/sungridengine/arrow/common/settings.csh
>
>
source /cs/sungridengine/arrow/common/settings.sh
 
Deleted:
<
<
to your configuration file (e.g. ~/csh_init/cshrc), or add
   source /cs/beta/lib/pkg/sge-6.0u7_1/default/common/settings.csh
   
to access the old arrow cluster. For the beta cluster, simply type the following command in a shell:
   source /cs/sungridengine/beta-icics/common/settings.sh
 
Changed:
<
<
  • ssh onto a submit host. For beta, submit hosts are (we're not sure), for arrow it is samos (which is now another name for arrow). Note that the tech staff has restricted ssh access to some hosts of the department (for security reasons), and the samos and arrowNN are affected by that. SSH access to arrowNN is restricted to connections coming from samos (note that you shouldn't actually be connecting to anything other than arrow01, and even that machine may have some restrictions), and SSH access to samos is restricted to connections from begbie and okanagan.
>
>

  • ssh onto a submit host (arrowhead.cs.ubc.ca). Note that the tech staff has restricted ssh access to some hosts of the department (for security reasons). arrowhead is only available locally. To SSH to arrowNN, you must first SSH to arrowhead.
 
  • A job is submitted to the cluster in the form of a shell (.sh) script.
  • You can either submit single or array jobs. An example for a single job would e.g. be the one-line script

Added:
>
>
#!/bin/bash
  echo 'Hello world.'
Added:
>
>
sleep 30 echo 'Good bye.' If this is the content of the file helloworld.sh, you can submit a job by typing:
   qsub -cwd -S /bin/bash -q all.q -P eh -o ./ ./helloworld.sh
 
Changed:
<
<
If this is the content of the file helloworld.sh, you can submit a job by typing
>
>
Your output should be similar to:
 
Changed:
<
<
qsub -cwd -o -e helloworld.sh
>
>
Your job 359475 ("helloworld.sh") has been submitted
 
Changed:
<
<
on the command line. This will create a new job with an automatically assigned job number <jobnumber> that is queued and eventually run on a machine in the cluster.
>
>
This will create a new job with an automatically assigned job number <jobnumber> that is queued and eventually run on a machine in the cluster.
  When the job runs, it will write output (stdout) to the file <outfiledir>/helloworld.sh.o<jobnumber> It will also create a file <errorfiledir>/helloworld.sh.e<jobnumber> and write stderr to that file. In the above case, "Hello world." will be written to the outfile and the errorfile will be empty. If you don't want output files (for array jobs you can easily end up with thousands) specify /dev/null as <>, and similarly for error files.
Added:
>
>
You can verify that your job is running by typing
   qstat -u "*"
   
Note The quotes around the astericks are not an error, you should see:
qstat -u "*"
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
 359476 0.56000 helloworld seramage     r     05/07/2014 17:59:30 all.q@arrow06.cs.ubc.ca            1        

If the state is r, then the job is running. If you see many jobs it may be held in state qr.

 
  • If you just want to see if the cluster works for you, use the following command. When the job finishes, an output file will be written to the current directory.
       qsub -cwd -o . -e . ~kevinlb/World/helloWorld.sh

Revision 422014-02-16 - seramage

Line: 1 to 1
 

Sun Grid Engine - quick user guide

This page is part of the EmpiricalAlgorithmics web.

Line: 39 to 39
 
  1. kpm: jobs of Kevin Murphy's students.
  2. kpmUrgent: urgent jobs of Kevin Murphy's students. If you use this priority class, please use the kpmUrgent consumable (i.e. add -l kpmUrgent=1 to your SGE command; see 'priority classes' below); if you don't you will block the cluster for everyone else.
  3. klb: jobs of Kevin Leyton-Brown's students.
Changed:
<
<
In order to submit in any priority class (even 'general'), access for that class must be explicitly granted to your user account. To request access, please contact Kevin Leyton-Brown. When he grants you access, he, or Steve Ramage can add you as a user.
>
>
In order to submit in any priority class (even 'general'), access for that class must be explicitly granted to your user account. To request access, please contact Kevin Leyton-Brown. When he grants you access, he, or Steve Ramage can add you as a user.
 

How do I submit jobs?

Revision 412014-02-14 - zongxumu

Line: 1 to 1
 

Sun Grid Engine - quick user guide

This page is part of the EmpiricalAlgorithmics web.

Line: 29 to 29
  Jobs running on the arrow cluster belong to one of four priority classes. Jobs are scheduled (selected to be run) pre-emptively by priority class, and then evenly among users within a priority class. (Note that scheduling among users is done on the basis of CPU usage, not on the basis of the number of jobs submitted. Thus a user who submits many fast jobs will be scheduled more often than a user at the same priority class who submits many slow jobs.) Because of the preemptive scheduling, users submitting to a lower priority class may see high latency (it may take days or weeks before a queued job is scheduled). On the other hand, these lower priority jobs will be allocated all 100 CPUs when no higher-priority jobs are waiting. All users should feel free to submit as many jobs as they like (but rather use few big array jobs than many single jobs), as doing so will not interfere with the cluster's ability to serve its primary purpose.
Added:
>
>
Note: much of the information on priority classes are out of date. The section will be updated soon.
 Priority classes are:
  1. Urgent: intended for very occasional use, mostly around paper deadlines. The cluster is limited to 10 such jobs at any time. Please use the urgent consumable (see 'priority classes' below).
  2. eh (for empirical hardness): jobs which pertain to the particular project mentioned in the CFI grant under which the cluster was funded.
Line: 38 to 39
 
  1. kpm: jobs of Kevin Murphy's students.
  2. kpmUrgent: urgent jobs of Kevin Murphy's students. If you use this priority class, please use the kpmUrgent consumable (i.e. add -l kpmUrgent=1 to your SGE command; see 'priority classes' below); if you don't you will block the cluster for everyone else.
  3. klb: jobs of Kevin Leyton-Brown's students.
Changed:
<
<
In order to submit in any priority class (even 'general'), access for that class must be explicitly granted to your user account. To request access, please contact Kevin Leyton-Brown. When he grants you access, he, Frank Hutter or Lin Xu can add you as a user.
>
>
In order to submit in any priority class (even 'general'), access for that class must be explicitly granted to your user account. To request access, please contact Kevin Leyton-Brown. When he grants you access, he, or Steve Ramage can add you as a user.
 

How do I submit jobs?

Line: 114 to 115
  Thus, if you submit many long jobs, you will block the cluster for other users. Due to the share-based scheduling we've set up, your overall share of computational time will not be larger if you submit larger jobs (e.g., shell scripts that invoke more runs of your program). However, while longer jobs will not increase your throughput, they will increase latency for other users.
  • Not ridiculously short jobs though!: jobs that are extremely short (eg 0.1 second) might take longer to be dispatched than they take to run, again hurting performance if a gigantic number of such jobs are in the queue. An ideal job length is on the order of a minute (i.e., if your jobs are extremely short, it's best to group them to achieve this sort of length). This would mean that higher-priority jobs would be able to access the cluster within about a second on average, but that job dispatching would not overwhelm the submit host.
Changed:
<
<
  • Like it or not, you can't submit ridiculously long jobs: Please keep unrestricted jobs in the range of a few minutes or maximally around an hour. Once a job runs on the cluster it will run until completion (except jobs longer than 25 hours which are automatically killed); this means that if you submit unrestricted long jobs you will completely block the cluster for all other users. So please try to split jobs up into array jobs (see the instructions on array jobs) where each single task takes just a few minutes. If you have to submit longer jobs, please use a consumable, such as max20 (see below).
>
>
  • <--*Like it or not, you can't submit ridiculously long jobs*: Please keep unrestricted jobs in the range of a few minutes or maximally around an hour. Once a job runs on the cluster it will run until completion (except jobs longer than 25 hours which are automatically killed); this means that if you submit unrestricted long jobs you will completely block the cluster for all other users. So please try to split jobs up into array jobs (see the instructions on array jobs) where each single task takes just a few minutes. If you have to submit longer jobs, please use a consumable, such as max20 (see below).-->
    As of Feb 2014, the arrow cluster is not as heavily contended as it once was. The usage is normally coordinated by gentlemanly agreement based on group priorities. Arrow will not try to kill your long jobs as well.
 

Array jobs

Revision 402012-01-05 - seramage

Line: 1 to 1
 

Sun Grid Engine - quick user guide

This page is part of the EmpiricalAlgorithmics web.

Line: 12 to 12
 

Available clusters

Added:
>
>

Note: As of Jan-05-2011, samos & ganglia no longer appear to be avaliable, and the submit host for arrow is arrowhead.cs.ubc.ca
 
  • beta cluster: 5 machines, two 2GHz CPUs each, 4GB memory, running Linux. Available for members of the beta lab.
    This should probably be used only via SGE (with a share-based scheduling system that will actually work, as opposed to the current first-come-first-serve scheme)
  • icics cluster: 13 machines, two 3GHz CPUs each, 2GB memory, running Linux. Available for all members of the department.

Revision 392010-03-31 - ChrisNell

Line: 1 to 1
 

Sun Grid Engine - quick user guide

This page is part of the EmpiricalAlgorithmics web.

Line: 167 to 167
 If your job may require significant amounts of memory, please use the consumable memheavy. You use this just like the Matlab or CPLEX consumables, i.e. -l memheavy=1. Only one job using this consumable will be scheduled on each machine.
Added:
>
>

Multi-core jobs (Arrow Cluster)

If your job parallelizes such that it will use both CPUs on a single node and potentially impair the performance of other jobs assigned to the same host, consider using the parallel environment fillup. This will assign 2 (i.e., both) slots on the same machine to your job. You use this by specifying -pe fillup 2 in your submit command.
 

Manually limiting the number of your jobs (Arrow Cluster)

By using one of the consumables max10, max20, max30, or max40, you can manually limit the number of jobs you run at once. This is useful if you have long jobs and are worried about blocking the whole cluster with your jobs. As usual for consumables, use e.g. -l max10=1. Note that you will compete with other users requesting the same consumable (only 10 max10 jobs can run, regardless of who submits them).

Revision 382010-03-24 - DavidThompson

Line: 1 to 1
 

Sun Grid Engine - quick user guide

This page is part of the EmpiricalAlgorithmics web.

Revision 372010-03-04 - jastyles

Line: 1 to 1
 

Sun Grid Engine - quick user guide

This page is part of the EmpiricalAlgorithmics web.

Line: 49 to 49
  to access the old arrow cluster. For the beta cluster, simply type the following command in a shell:

Changed:
<
<
use sge
>
>
source /cs/sungridengine/beta-icics/common/settings.sh
 
  • ssh onto a submit host. For beta, submit hosts are (we're not sure), for arrow it is samos (which is now another name for arrow). Note that the tech staff has restricted ssh access to some hosts of the department (for security reasons), and the samos and arrowNN are affected by that. SSH access to arrowNN is restricted to connections coming from samos (note that you shouldn't actually be connecting to anything other than arrow01, and even that machine may have some restrictions), and SSH access to samos is restricted to connections from begbie and okanagan.
  • A job is submitted to the cluster in the form of a shell (.sh) script.

Revision 362010-02-04 - DavidThompson

Line: 1 to 1
 

Sun Grid Engine - quick user guide

This page is part of the EmpiricalAlgorithmics web.

Line: 39 to 39
 

How do I submit jobs?

Changed:
<
<
  • For the arrow cluster, add the line
>
>
  • For the (new) arrow cluster, add the line
       source /cs/sungridengine/arrow/common/settings.csh
       
    to your configuration file (e.g. ~/csh_init/cshrc), or add
 
   source /cs/beta/lib/pkg/sge-6.0u7_1/default/common/settings.csh
   
Changed:
<
<
to your configuration file (e.g. ~/csh_init/.cshrc). For the beta cluster, simply type the following command in a shell:
>
>
to access the old arrow cluster. For the beta cluster, simply type the following command in a shell:
 
   use sge
   

Revision 352009-07-23 - JamesWright

Line: 1 to 1
 

Sun Grid Engine - quick user guide

This page is part of the EmpiricalAlgorithmics web.

Line: 33 to 33
 
  1. low: jobs of particularly low priority. This priority class can be used to submit a huge amount of jobs that will give way to any other jobs in the queue if there are any.
  2. kpm: jobs of Kevin Murphy's students.
  3. kpmUrgent: urgent jobs of Kevin Murphy's students. If you use this priority class, please use the kpmUrgent consumable (i.e. add -l kpmUrgent=1 to your SGE command; see 'priority classes' below); if you don't you will block the cluster for everyone else.
Added:
>
>
  1. klb: jobs of Kevin Leyton-Brown's students.
 In order to submit in any priority class (even 'general'), access for that class must be explicitly granted to your user account. To request access, please contact Kevin Leyton-Brown. When he grants you access, he, Frank Hutter or Lin Xu can add you as a user.

Revision 342009-04-16 - DerekBradley

Line: 1 to 1
 

Sun Grid Engine - quick user guide

This page is part of the EmpiricalAlgorithmics web.

Line: 162 to 162
 If your job may require significant amounts of memory, please use the consumable memheavy. You use this just like the Matlab or CPLEX consumables, i.e. -l memheavy=1. Only one job using this consumable will be scheduled on each machine.
Changed:
<
<

Manually limiting the number of your jobs

>
>

Manually limiting the number of your jobs (Arrow Cluster)

 By using one of the consumables max10, max20, max30, or max40, you can manually limit the number of jobs you run at once. This is useful if you have long jobs and are worried about blocking the whole cluster with your jobs. As usual for consumables, use e.g. -l max10=1. Note that you will compete with other users requesting the same consumable (only 10 max10 jobs can run, regardless of who submits them).
Changed:
<
<

NEW

Here is probably a better way to manually limit the number of jobs you are running. Use one of the following options when running qsub:
>
>

Manually limiting the number of your jobs (ICICS/BETA Clusters)

Here is another way to manually limit the number of jobs you are running, since the maxK consumables are not currently implemented for the ICICS & BETA clusters. Use one of the following options when running qsub:
 
  • [-h] place user hold on job
  • [-hold_jid job_identifier_list] define jobnet interdependencies
Changed:
<
<
The [-h] option will hold your array of jobs in the queue until there are NO other jobs running on the SGE. The [-hold_jid job_identifier_list] will hold your array of jobs in the queue until a SPECIFIC job(s) is finished, defined by the job_identifier_list. So, to manually limit the number of jobs you are running (lets say 10 jobs) but NOT limit yourself to the max10 consumable, split your array up to blocks of 10 and submit them one at a time with the [-hold_jid jobID] option, where jobID is the ID of the previous block of 10. This will ensure that you never use more than 10 machines at a time.
>
>
The [-h] option will hold your array of jobs in the queue until there are NO other jobs running on the SGE. This isn't really recommended unless you want to be REALLY nice and make sure NO-ONE is using the cluster when you do.
 
Changed:
<
<
Note by Frank: not sure who recommended this, but it certainly doesn't work for everyone; the queue simply doesn't empty very often, so jobs with holds on them tend to never run. My advice is to stay with the maxK consumables above.
>
>
The best option is to use the [-hold_jid job_identifier_list], which will hold your array of jobs in the queue until a SPECIFIC job(s) is finished, defined by the job_identifier_list. So, to manually limit the number of jobs you are running (lets say 10 jobs), split your array up to blocks of 10 and submit them one at a time with the [-hold_jid jobID] option, where jobID is the ID of the previous block of 10. This will ensure that you never use more than 10 machines at a time, and it will also let you use any 10 machines.
 

What is the configuration of the machines, and how busy are they?

See http://samos.cs.ubc.ca/ganglia/ for machine load. To get the configuration of a machine, click on the appropriate cluster, and then on the machine of interest.

Revision 332009-03-13 - FrankHutter

Line: 1 to 1
 

Sun Grid Engine - quick user guide

This page is part of the EmpiricalAlgorithmics web.

Line: 165 to 165
 

Manually limiting the number of your jobs

By using one of the consumables max10, max20, max30, or max40, you can manually limit the number of jobs you run at once. This is useful if you have long jobs and are worried about blocking the whole cluster with your jobs. As usual for consumables, use e.g. -l max10=1. Note that you will compete with other users requesting the same consumable (only 10 max10 jobs can run, regardless of who submits them).
Changed:
<
<

NEW

>
>

NEW

 Here is probably a better way to manually limit the number of jobs you are running. Use one of the following options when running qsub:
  • [-h] place user hold on job
  • [-hold_jid job_identifier_list] define jobnet interdependencies

The [-h] option will hold your array of jobs in the queue until there are NO other jobs running on the SGE. The [-hold_jid job_identifier_list] will hold your array of jobs in the queue until a SPECIFIC job(s) is finished, defined by the job_identifier_list. So, to manually limit the number of jobs you are running (lets say 10 jobs) but NOT limit yourself to the max10 consumable, split your array up to blocks of 10 and submit them one at a time with the [-hold_jid jobID] option, where jobID is the ID of the previous block of 10. This will ensure that you never use more than 10 machines at a time.

Added:
>
>
Note by Frank: not sure who recommended this, but it certainly doesn't work for everyone; the queue simply doesn't empty very often, so jobs with holds on them tend to never run. My advice is to stay with the maxK consumables above.
 

What is the configuration of the machines, and how busy are they?

See http://samos.cs.ubc.ca/ganglia/ for machine load. To get the configuration of a machine, click on the appropriate cluster, and then on the machine of interest.

Revision 322009-03-09 - FrankHutter

Line: 1 to 1
 

Sun Grid Engine - quick user guide

This page is part of the EmpiricalAlgorithmics web.

Line: 196 to 196
 

Administration

For details on how to administer the cluster (requires admin access), look at SunGridEngineAdmin.

Added:
>
>
For software installed on arrow (or software you need installed there), see ArrowSoftware.
  -- FrankHutter and Lin Xu - 02 May 2006

Revision 312009-02-18 - FrankHutter

Line: 1 to 1
 

Sun Grid Engine - quick user guide

This page is part of the EmpiricalAlgorithmics web.

Line: 32 to 32
 
  1. general: jobs which do not fall into one of the above categories. We ask that these jobs be relatively short in duration (although there can be arbitrarily many of them): since new jobs can only be scheduled when a processor becomes idle, excessively long low-priority jobs can lead to starvation of high-priority jobs.
  2. low: jobs of particularly low priority. This priority class can be used to submit a huge amount of jobs that will give way to any other jobs in the queue if there are any.
  3. kpm: jobs of Kevin Murphy's students.
Changed:
<
<
  1. kpmUrgent: urgent jobs of Kevin Murphy's students. Please use the kpmUrgent consumable (see 'priority classes' below).
>
>
  1. kpmUrgent: urgent jobs of Kevin Murphy's students. If you use this priority class, please use the kpmUrgent consumable (i.e. add -l kpmUrgent=1 to your SGE command; see 'priority classes' below); if you don't you will block the cluster for everyone else.
 In order to submit in any priority class (even 'general'), access for that class must be explicitly granted to your user account. To request access, please contact Kevin Leyton-Brown. When he grants you access, he, Frank Hutter or Lin Xu can add you as a user.

Revision 302009-01-23 - DerekBradley

Line: 1 to 1
 

Sun Grid Engine - quick user guide

This page is part of the EmpiricalAlgorithmics web.

Line: 170 to 170
 
  • [-h] place user hold on job
  • [-hold_jid job_identifier_list] define jobnet interdependencies
Changed:
<
<
The [-h] option will hold your array of jobs in the queue until you have NO other jobs running on the SGE. The [-hold_jid job_identifier_list] will hold your array of jobs in the queue until a SPECIFIC job(s) is finished, defined by the job_identifier_list. So, to manually limit the number of jobs you are running (lets say 10 jobs) but NOT limit yourself to the max10 consumable, split your array up to blocks of 10 and submit them with the [-h] option. This will ensure that you never use more than 10 machines at a time.
>
>
The [-h] option will hold your array of jobs in the queue until there are NO other jobs running on the SGE. The [-hold_jid job_identifier_list] will hold your array of jobs in the queue until a SPECIFIC job(s) is finished, defined by the job_identifier_list. So, to manually limit the number of jobs you are running (lets say 10 jobs) but NOT limit yourself to the max10 consumable, split your array up to blocks of 10 and submit them one at a time with the [-hold_jid jobID] option, where jobID is the ID of the previous block of 10. This will ensure that you never use more than 10 machines at a time.
 

What is the configuration of the machines, and how busy are they?

See http://samos.cs.ubc.ca/ganglia/ for machine load. To get the configuration of a machine, click on the appropriate cluster, and then on the machine of interest.

Revision 292009-01-20 - DerekBradley

Line: 1 to 1
 

Sun Grid Engine - quick user guide

This page is part of the EmpiricalAlgorithmics web.

Line: 165 to 165
 

Manually limiting the number of your jobs

By using one of the consumables max10, max20, max30, or max40, you can manually limit the number of jobs you run at once. This is useful if you have long jobs and are worried about blocking the whole cluster with your jobs. As usual for consumables, use e.g. -l max10=1. Note that you will compete with other users requesting the same consumable (only 10 max10 jobs can run, regardless of who submits them).
Added:
>
>

NEW

Here is probably a better way to manually limit the number of jobs you are running. Use one of the following options when running qsub:
  • [-h] place user hold on job
  • [-hold_jid job_identifier_list] define jobnet interdependencies

The [-h] option will hold your array of jobs in the queue until you have NO other jobs running on the SGE. The [-hold_jid job_identifier_list] will hold your array of jobs in the queue until a SPECIFIC job(s) is finished, defined by the job_identifier_list. So, to manually limit the number of jobs you are running (lets say 10 jobs) but NOT limit yourself to the max10 consumable, split your array up to blocks of 10 and submit them with the [-h] option. This will ensure that you never use more than 10 machines at a time.

 

What is the configuration of the machines, and how busy are they?

See http://samos.cs.ubc.ca/ganglia/ for machine load. To get the configuration of a machine, click on the appropriate cluster, and then on the machine of interest.

Revision 282009-01-15 - FrankHutter

Line: 1 to 1
 

Sun Grid Engine - quick user guide

This page is part of the EmpiricalAlgorithmics web.

Line: 46 to 46
 
   use sge
   
Changed:
<
<
  • ssh onto a submit host. For beta, submit hosts are (we're not sure), for arrow it is samos (which is now another name for arrow)
>
>
  • ssh onto a submit host. For beta, submit hosts are (we're not sure), for arrow it is samos (which is now another name for arrow). Note that the tech staff has restricted ssh access to some hosts of the department (for security reasons), and the samos and arrowNN are affected by that. SSH access to arrowNN is restricted to connections coming from samos (note that you shouldn't actually be connecting to anything other than arrow01, and even that machine may have some restrictions), and SSH access to samos is restricted to connections from begbie and okanagan.
 
  • A job is submitted to the cluster in the form of a shell (.sh) script.
  • You can either submit single or array jobs. An example for a single job would e.g. be the one-line script

Revision 272008-06-05 - MirelaAndronescu

Line: 1 to 1
 

Sun Grid Engine - quick user guide

This page is part of the EmpiricalAlgorithmics web.

Line: 74 to 74
  qsub -cwd -o /dev/null -e /dev/null -q icics.q script.sh
Added:
>
>
  • If you want your script to include some more fancy commands rather than just call your program, you have to add the line #$ -S /bin/sh to your script, or you have to add -S /bin/sh as an argument to qsub. For example, if your script has a while, it won't work unless your script has the following two lines at the beginning:
       #!/bin/sh
       #$ -S /bin/sh
       
    or it has only the first line #!/bin/sh, and you call it like this:
       qsub -cwd -S /bin/sh script.sh
       

  • Similarly, you can submit a Perl script instead of a shell script (Yeah, I like that too!), you just have to start your script by
       #!/cs/local/bin/perl
       #$ -S /cs/local/bin/perl
       
    or you can add -S /cs/local/bin/perl to your qsub command and remove the second line in the script (the first one still needs to be there).

  • Probably it would work with other scripting languages too!

I hate shell scripts! Can I use Perl instead? Or some other scripting language?

  • Luckily, YES! Read the previous section "How do I submit jobs".
 

What is an appropriate job size?

  • Short Jobs Please!: To the extent possible, submit short jobs.

Revision 262008-03-26 - FrankHutter

Line: 1 to 1
 

Sun Grid Engine - quick user guide

This page is part of the EmpiricalAlgorithmics web.

Line: 81 to 81
  Thus, if you submit many long jobs, you will block the cluster for other users. Due to the share-based scheduling we've set up, your overall share of computational time will not be larger if you submit larger jobs (e.g., shell scripts that invoke more runs of your program). However, while longer jobs will not increase your throughput, they will increase latency for other users.
  • Not ridiculously short jobs though!: jobs that are extremely short (eg 0.1 second) might take longer to be dispatched than they take to run, again hurting performance if a gigantic number of such jobs are in the queue. An ideal job length is on the order of a minute (i.e., if your jobs are extremely short, it's best to group them to achieve this sort of length). This would mean that higher-priority jobs would be able to access the cluster within about a second on average, but that job dispatching would not overwhelm the submit host.
Changed:
<
<
  • Like it or not, you can't submit ridiculously long jobs: On the arrow cluster, jobs that run longer than 25 hours are automatically killed, but rather try to keep jobs to minutes or hours. This makes it easiest for SGE to help share resources in a fair way. See the instructions on array jobs for how to split up a really long job.
>
>
  • Like it or not, you can't submit ridiculously long jobs: Please keep unrestricted jobs in the range of a few minutes or maximally around an hour. Once a job runs on the cluster it will run until completion (except jobs longer than 25 hours which are automatically killed); this means that if you submit unrestricted long jobs you will completely block the cluster for all other users. So please try to split jobs up into array jobs (see the instructions on array jobs) where each single task takes just a few minutes. If you have to submit longer jobs, please use a consumable, such as max20 (see below).
 

Array jobs

Revision 252008-01-21 - FrankHutter

Line: 1 to 1
 

Sun Grid Engine - quick user guide

This page is part of the EmpiricalAlgorithmics web.

Line: 33 to 33
 
  1. low: jobs of particularly low priority. This priority class can be used to submit a huge amount of jobs that will give way to any other jobs in the queue if there are any.
  2. kpm: jobs of Kevin Murphy's students.
  3. kpmUrgent: urgent jobs of Kevin Murphy's students. Please use the kpmUrgent consumable (see 'priority classes' below).
Changed:
<
<
In order to submit in any priority class (even 'general'), access for that class must be explicitly granted to your user account. To request access, please contact Frank Hutter, Lin Xu or Kevin Leyton-Brown.
>
>
In order to submit in any priority class (even 'general'), access for that class must be explicitly granted to your user account. To request access, please contact Kevin Leyton-Brown. When he grants you access, he, Frank Hutter or Lin Xu can add you as a user.
 

How do I submit jobs?

Revision 242007-10-02 - ErikZawadzki

Line: 1 to 1
 

Sun Grid Engine - quick user guide

This page is part of the EmpiricalAlgorithmics web.

Line: 145 to 145
 

What is the configuration of the machines, and how busy are they?

See http://samos.cs.ubc.ca/ganglia/ for machine load. To get the configuration of a machine, click on the appropriate cluster, and then on the machine of interest.
Added:
>
>

Common Problems

  • Your job has the status "Eqw"
    • Try _dos2unix_ing the file that your queue commands are in
 

Where do Arrow jobs run?

Right here:

Revision 222007-06-26 - MirelaAndronescu

Line: 1 to 1
 

Sun Grid Engine - quick user guide

Line: 57 to 57
  on the command line. This will create a new job with an automatically assigned job number <jobnumber> that is queued and eventually run on a machine in the cluster. When the job runs, it will write output (stdout) to the file <outfiledir>/helloworld.sh.o<jobnumber> It will also create a file <errorfiledir>/helloworld.sh.e<jobnumber> and write stderr to that file. In the above case, "Hello world." will be written to the outfile and the errorfile will be empty. If you don't want output files (for array jobs you can easily end up with thousands) specify /dev/null as <>, and similarly for error files.
Changed:
<
<
If you just want to see if the cluster works for you, use the following command. When the job finishes, an output file will be written to the current directory.
>
>
  • If you just want to see if the cluster works for you, use the following command. When the job finishes, an output file will be written to the current directory.
 
   qsub -cwd -o . -e . ~kevinlb/World/helloWorld.sh
   
Added:
>
>
  • If you don't want to store the output and error files for your job (be cautious though!), then you can throw them away directly, i.e. send them to /dev/null:
       qsub -cwd -o /dev/null -e /dev/null script.sh
       

  • If you want to submit to a specific queue, for example to have same-power machines, specify -q queue_name. The name of icics is icics.q, and the name of beta is beta.q:
       qsub -cwd -o /dev/null -e /dev/null -q icics.q script.sh
       
 

What is an appropriate job size?

  • Short Jobs Please!: To the extent possible, submit short jobs.

Revision 212007-06-19 - ChrisThachuk

Line: 1 to 1
 

Sun Grid Engine - quick user guide

Line: 40 to 40
 
   source /cs/beta/lib/pkg/sge-6.0u7_1/default/common/settings.csh
   
Changed:
<
<
to your configuration file (e.g. ~/csh_init/.cshrc). For the beta cluster, the appropriate line to add is
>
>
to your configuration file (e.g. ~/csh_init/.cshrc). For the beta cluster, simply type the following command in a shell:
 
Changed:
<
<
source /cs/beta/lib/pkg/sge/beta_grid/common/settings.csh
>
>
use sge
 
Deleted:
<
<
but we may completely get rid of this configuration and have it all in one. Currently, you work solely with the one cluster that is indicated by this line in the configuration file - there is no easy way to go back and forth (but this will hopefully change).
 
  • ssh onto a submit host. For beta, submit hosts are (we're not sure), for arrow it is samos (which is now another name for arrow)
  • A job is submitted to the cluster in the form of a shell (.sh) script.
  • You can either submit single or array jobs. An example for a single job would e.g. be the one-line script

Revision 192007-05-05 - KevinLeytonBrown

Line: 1 to 1
 

Sun Grid Engine - quick user guide

Line: 142 to 142
 

Arrow users: CFI Blurbs

Changed:
<
<
The Arrow cluster was funded under a CFI grant. This grant requires us to complete annual reports explaining how this infrastructure is being used. If you use the cluster for a project, large or small, please enter a bullet item here that gives a short description of your project and the role the cluster played.
  • Sample project: description (students/faculty involved)
>
>
The Arrow cluster was funded under a CFI grant. This grant requires us to complete annual reports explaining how this infrastructure is being used. This information is summarized on the page ArrowClusterCFISummaries.
 

Administration

Revision 182007-04-17 - KevinLeytonBrown

Line: 1 to 1
 

Sun Grid Engine - quick user guide

Line: 110 to 110
 

CPLEX and MATLAB license management

Changed:
<
<
If your job uses CPLEX or Matlab, add -l cplex=1 or -l matlab=1 to your qsub command. This will ensure that we don't run more jobs than there are licenses. Instructions on how to use CPLEX are found here. We're not exactly sure of the best way to run MATLAB without invoking X-windows, but the best we've been able to do is
>
>
If your job uses CPLEX or Matlab, add -l cplex=1 or -l matlab=1 to your qsub command. This will ensure that we don't run more jobs than there are licenses. Instructions on how to use CPLEX are found here. We're not exactly sure of the best way to run MATLAB without invoking X-windows, but the best we've been able to do is
 
   matlab -nojvm -nodisplay -nosplash < inputfile.m
   
Added:
>
>
Another good approach is to use the Matlab compiler. The advantage of this is that compiled Matlab code does not require licenses to run.
  Right now the cluster allows 22 simultaneous CPLEX processes and (separately) 20 simultaneous Matlab processes. Using the -l syntax above ensures that no more than this number of processes run. If you don't use the syntax, you run the risk of invoking more processes than there are licenses, causing some of your jobs to fail and thus requiring you to rerun jobs. The department as a whole has 100 Matlab licenses, of which many are in use at any time; thus, running a Matlab job without the -l syntax runs a high chance of failure. If there is a high level of Matlab usage in the department, it's also possible that you will run out of licenses even if you do use the -l flag, as this doesn't actually reserve the licenses for you. We have 22 CPLEX licenses, but they should only be used on this cluster; thus again you'll need to use -l but you shouldn't have to worry about users elsewhere in the dept. To check the number of Matlab licenses available you can type:

Line: 124 to 125
 If you want to manually limit the number of processes that can run simultaneously to a different number, you can use -l max10=1, -l max20=1, -l max30=1 or -l max40=1. This will allow you to run 10, 20, 30 or 40 jobs at once. (Note that if multiple people running simultaneous jobs both use the same flag, the number of simultaneous jobs will be limited across all users.)

Memory intensive jobs

Changed:
<
<
If your job may require significant amounts of memory, please use the consumable memheavy. You use this just like the Matlab or CPLEX consumables, i.e. -l memheavy=1.
>
>
If your job may require significant amounts of memory, please use the consumable memheavy. You use this just like the Matlab or CPLEX consumables, i.e. -l memheavy=1.
 Only one job using this consumable will be scheduled on each machine.

Manually limiting the number of your jobs

Revision 172007-04-16 - FrankHutter

Line: 1 to 1
 

Sun Grid Engine - quick user guide

Line: 123 to 123
  If you want to manually limit the number of processes that can run simultaneously to a different number, you can use -l max10=1, -l max20=1, -l max30=1 or -l max40=1. This will allow you to run 10, 20, 30 or 40 jobs at once. (Note that if multiple people running simultaneous jobs both use the same flag, the number of simultaneous jobs will be limited across all users.)
Added:
>
>

Memory intensive jobs

If your job may require significant amounts of memory, please use the consumable memheavy. You use this just like the Matlab or CPLEX consumables, i.e. -l memheavy=1. Only one job using this consumable will be scheduled on each machine.

Manually limiting the number of your jobs

By using one of the consumables max10, max20, max30, or max40, you can manually limit the number of jobs you run at once. This is useful if you have long jobs and are worried about blocking the whole cluster with your jobs. As usual for consumables, use e.g. -l max10=1. Note that you will compete with other users requesting the same consumable (only 10 max10 jobs can run, regardless of who submits them).
 

What is the configuration of the machines, and how busy are they?

See http://samos.cs.ubc.ca/ganglia/ for machine load. To get the configuration of a machine, click on the appropriate cluster, and then on the machine of interest.

Revision 162007-03-19 - FrankHutter

Line: 1 to 1
 

Sun Grid Engine - quick user guide

Line: 123 to 123
  If you want to manually limit the number of processes that can run simultaneously to a different number, you can use -l max10=1, -l max20=1, -l max30=1 or -l max40=1. This will allow you to run 10, 20, 30 or 40 jobs at once. (Note that if multiple people running simultaneous jobs both use the same flag, the number of simultaneous jobs will be limited across all users.)
Added:
>
>

What is the configuration of the machines, and how busy are they?

See http://samos.cs.ubc.ca/ganglia/ for machine load. To get the configuration of a machine, click on the appropriate cluster, and then on the machine of interest.
 

Where do Arrow jobs run?

Right here:

Revision 152007-03-09 - FrankHutter

Line: 1 to 1
 

Sun Grid Engine - quick user guide

Line: 24 to 24
 Jobs running on the arrow cluster belong to one of four priority classes. Jobs are scheduled (selected to be run) pre-emptively by priority class, and then evenly among users within a priority class. (Note that scheduling among users is done on the basis of CPU usage, not on the basis of the number of jobs submitted. Thus a user who submits many fast jobs will be scheduled more often than a user at the same priority class who submits many slow jobs.) Because of the preemptive scheduling, users submitting to a lower priority class may see high latency (it may take days or weeks before a queued job is scheduled). On the other hand, these lower priority jobs will be allocated all 100 CPUs when no higher-priority jobs are waiting. All users should feel free to submit as many jobs as they like (but rather use few big array jobs than many single jobs), as doing so will not interfere with the cluster's ability to serve its primary purpose.

Priority classes are:

Changed:
<
<
  1. Urgent: intended for very occasional use, mostly around paper deadlines. The cluster is limited to 10 such jobs at any time.
>
>
  1. Urgent: intended for very occasional use, mostly around paper deadlines. The cluster is limited to 10 such jobs at any time. Please use the urgent consumable (see 'priority classes' below).
 
  1. eh (for empirical hardness): jobs which pertain to the particular project mentioned in the CFI grant under which the cluster was funded.
  2. ea (for empirical algorithmics): studies on the empirical properties of algorithms (i.e., "the 'E' in BETA"), but not part of the project described above.
  3. general: jobs which do not fall into one of the above categories. We ask that these jobs be relatively short in duration (although there can be arbitrarily many of them): since new jobs can only be scheduled when a processor becomes idle, excessively long low-priority jobs can lead to starvation of high-priority jobs.
  4. low: jobs of particularly low priority. This priority class can be used to submit a huge amount of jobs that will give way to any other jobs in the queue if there are any.
Added:
>
>
  1. kpm: jobs of Kevin Murphy's students.
  2. kpmUrgent: urgent jobs of Kevin Murphy's students. Please use the kpmUrgent consumable (see 'priority classes' below).
 In order to submit in any priority class (even 'general'), access for that class must be explicitly granted to your user account. To request access, please contact Frank Hutter, Lin Xu or Kevin Leyton-Brown.
Line: 92 to 94
 

Priority Classes

Changed:
<
<
If you use the qsub syntax above on arrow, your job will be assigned to the default priority class associated with your user account. This cannot be the priority class Urgent. To use another priority class than your default for a job, use the following syntax (note the capital P):
>
>
If you use the qsub syntax above on arrow, your job will be assigned to the default priority class associated with your user account. This cannot be the priority class Urgent or the class kpmUrgent. To use another priority class than your default for a job, use the following syntax (note the capital P):
 
   qsub -cwd -o <outfiledir> -e <errorfiledir> -P <priorityclass> helloworld.sh
   
Line: 100 to 102
 
   qsub -cwd -o <outfiledir> -e <errorfiledir> -P Urgent -l urgent=1 helloworld.sh
   
Added:
>
>
To submit in the kpmUrgent class, please use:
   qsub -cwd -o <outfiledir> -e <errorfiledir> -P kpmUrgent -l kpmUrgent=1 helloworld.sh
   
  As mentioned above, the total number of urgent jobs is limited to 10. This is even true if those are the only jobs on the cluster - in that case (which probably will never happen because many users will have stuff waiting) you can still fill the rest of the cluster with "normal" jobs.

CPLEX and MATLAB license management

Revision 142006-11-03 - KevinLeytonBrown

Line: 1 to 1
 

Sun Grid Engine - quick user guide

Line: 123 to 123
 
Added:
>
>

Arrow users: CFI Blurbs

The Arrow cluster was funded under a CFI grant. This grant requires us to complete annual reports explaining how this infrastructure is being used. If you use the cluster for a project, large or small, please enter a bullet item here that gives a short description of your project and the role the cluster played.

  • Sample project: description (students/faculty involved)
 

Administration

For details on how to administer the cluster (requires admin access), look at SunGridEngineAdmin.

Revision 132006-10-31 - KevinLeytonBrown

Line: 1 to 1
 

Sun Grid Engine - quick user guide

Line: 56 to 56
  on the command line. This will create a new job with an automatically assigned job number <jobnumber> that is queued and eventually run on a machine in the cluster. When the job runs, it will write output (stdout) to the file <outfiledir>/helloworld.sh.o<jobnumber> It will also create a file <errorfiledir>/helloworld.sh.e<jobnumber> and write stderr to that file. In the above case, "Hello world." will be written to the outfile and the errorfile will be empty. If you don't want output files (for array jobs you can easily end up with thousands) specify /dev/null as <>, and similarly for error files.
Added:
>
>
If you just want to see if the cluster works for you, use the following command. When the job finishes, an output file will be written to the current directory.
   qsub -cwd -o . -e . ~kevinlb/World/helloWorld.sh
   
 

What is an appropriate job size?

  • Short Jobs Please!: To the extent possible, submit short jobs.
Line: 104 to 109
  matlab -nojvm -nodisplay -nosplash < inputfile.m
Changed:
<
<
Right now the cluster allows 22 simultaneous CPLEX processes and (separately) 10 simultaneous Matlab processes. Using the -l syntax above ensures that no more than this number of processes run. If you don't use the syntax, you run the risk of invoking more processes than there are licenses, causing some of your jobs to fail and thus requiring you to rerun jobs. The department as a whole has 100 Matlab licenses, of which many are in use at any time; thus, running a Matlab job without the -l syntax runs a high chance of failure. We have 22 CPLEX licenses, but they should only be used on this cluster; thus again you'll need to use -l but you shouldn't have to worry about users elsewhere in the dept. To check the number of Matlab licenses available you can type:
>
>
Right now the cluster allows 22 simultaneous CPLEX processes and (separately) 20 simultaneous Matlab processes. Using the -l syntax above ensures that no more than this number of processes run. If you don't use the syntax, you run the risk of invoking more processes than there are licenses, causing some of your jobs to fail and thus requiring you to rerun jobs. The department as a whole has 100 Matlab licenses, of which many are in use at any time; thus, running a Matlab job without the -l syntax runs a high chance of failure. If there is a high level of Matlab usage in the department, it's also possible that you will run out of licenses even if you do use the -l flag, as this doesn't actually reserve the licenses for you. We have 22 CPLEX licenses, but they should only be used on this cluster; thus again you'll need to use -l but you shouldn't have to worry about users elsewhere in the dept. To check the number of Matlab licenses available you can type:
 
   /cs/local/generic/lib/pkg/matlab-7.2/etc/lmstat -a
   
To change the number of available Matlab licenses when using -l matlab=1, an administrator has to change the cluster configuration; details on how to do this are on the administration page.
Added:
>
>
If you want to manually limit the number of processes that can run simultaneously to a different number, you can use -l max10=1, -l max20=1, -l max30=1 or -l max40=1. This will allow you to run 10, 20, 30 or 40 jobs at once. (Note that if multiple people running simultaneous jobs both use the same flag, the number of simultaneous jobs will be limited across all users.)
 

Where do Arrow jobs run?

Right here:

Revision 122006-10-24 - jtillett

Line: 1 to 1
 

Sun Grid Engine - quick user guide

Line: 104 to 104
  matlab -nojvm -nodisplay -nosplash < inputfile.m
Changed:
<
<
Right now the cluster allows 22 simultaneous CPLEX processes and (separately) 10 simultaneous Matlab processes. Using the -l syntax above ensures that no more than this number of processes run. If you don't use the syntax, you run the risk of invoking more processes than there are licenses, causing some of your jobs to fail and thus requiring you to rerun jobs. The department as a whole has 100 CPLEX licenses, of which many are in use at any time; thus, running a Matlab job without the -l syntax runs a high chance of failure. We have 22 CPLEX licenses, but they should only be used on this cluster; thus again you'll need to use -l but you shouldn't have to worry about users elsewhere in the dept. To check the number of Matlab licenses available you can type:
>
>
Right now the cluster allows 22 simultaneous CPLEX processes and (separately) 10 simultaneous Matlab processes. Using the -l syntax above ensures that no more than this number of processes run. If you don't use the syntax, you run the risk of invoking more processes than there are licenses, causing some of your jobs to fail and thus requiring you to rerun jobs. The department as a whole has 100 Matlab licenses, of which many are in use at any time; thus, running a Matlab job without the -l syntax runs a high chance of failure. We have 22 CPLEX licenses, but they should only be used on this cluster; thus again you'll need to use -l but you shouldn't have to worry about users elsewhere in the dept. To check the number of Matlab licenses available you can type:
 
   /cs/local/generic/lib/pkg/matlab-7.2/etc/lmstat -a
   

Revision 112006-10-13 - KevinLeytonBrown

Line: 1 to 1
 

Sun Grid Engine - quick user guide

Line: 32 to 32
 In order to submit in any priority class (even 'general'), access for that class must be explicitly granted to your user account. To request access, please contact Frank Hutter, Lin Xu or Kevin Leyton-Brown.
Changed:
<
<

How to submit jobs

>
>

How do I submit jobs?

 
  • For the arrow cluster, add the line

Line: 56 to 56
  on the command line. This will create a new job with an automatically assigned job number <jobnumber> that is queued and eventually run on a machine in the cluster. When the job runs, it will write output (stdout) to the file <outfiledir>/helloworld.sh.o<jobnumber> It will also create a file <errorfiledir>/helloworld.sh.e<jobnumber> and write stderr to that file. In the above case, "Hello world." will be written to the outfile and the errorfile will be empty. If you don't want output files (for array jobs you can easily end up with thousands) specify /dev/null as <>, and similarly for error files.
Changed:
<
<

Tips on Submitting Jobs

>
>

What is an appropriate job size?

 
  • Short Jobs Please!: To the extent possible, submit short jobs. We have not configured SGE as load balancing software, because allowing jobs to be paused or migrated can affect the accuracy of process timing. Once a job is running it runs with 100% of a CPU until it's done. Thus, if you submit many long jobs, you will block the cluster for other users.
Changed:
<
<
Due to the share-based scheduling we've set up, your overall share of computational time will not be larger if you submit larger jobs (e.g., shell scripts that invoke more runs of your program). However, while longer jobs will not increase your throughput, they will increase latency for other users.
On the arrow cluster, jobs that run longer than 25 hours are automatically killed, but rather try to keep jobs to minutes or hours. This makes it easiest for SGE to help share resources in a fair way.
>
>
Due to the share-based scheduling we've set up, your overall share of computational time will not be larger if you submit larger jobs (e.g., shell scripts that invoke more runs of your program). However, while longer jobs will not increase your throughput, they will increase latency for other users.
  • Not ridiculously short jobs though!: jobs that are extremely short (eg 0.1 second) might take longer to be dispatched than they take to run, again hurting performance if a gigantic number of such jobs are in the queue. An ideal job length is on the order of a minute (i.e., if your jobs are extremely short, it's best to group them to achieve this sort of length). This would mean that higher-priority jobs would be able to access the cluster within about a second on average, but that job dispatching would not overwhelm the submit host.
  • Like it or not, you can't submit ridiculously long jobs: On the arrow cluster, jobs that run longer than 25 hours are automatically killed, but rather try to keep jobs to minutes or hours. This makes it easiest for SGE to help share resources in a fair way. See the instructions on array jobs for how to split up a really long job.
 
Changed:
<
<
  • Priority Class: If you use the qsub syntax above on arrow, your job will be assigned to the default priority class associated with your user account. This cannot be the priority class Urgent. To use another priority class than your default for a job, use the following syntax (note the capital P):
       qsub -cwd -o <outfiledir> -e <errorfiledir> -P <priorityclass> helloworld.sh
       
    To submit in the urgent class, please use:
       qsub -cwd -o <outfiledir> -e <errorfiledir> -P Urgent -l urgent=1 helloworld.sh
       
    As mentioned above, the total number of urgent jobs is limited to 10. This is even true if those are the only jobs on the cluster - in that case (which probably will never happen because many users will have stuff waiting) you can still fill the rest of the cluster with "normal" jobs.
>
>

Array jobs

 
Changed:
<
<
  • CPLEX and MATLAB if your job uses CPLEX or Matlab, add -l cplex=1 or -l matlab=1 to your qsub command. This will ensure that we don't run more jobs than there are licenses. Instructions on how to use CPLEX are found here. We're not exactly sure of the best way to run MATLAB without invoking X-windows, but the best we've been able to do is
       matlab -nojvm -nodisplay -nosplash < inputfile.m
       
>
>
Array jobs are useful for submitting many similar jobs. In general, you should try to submit a single array job for each big run you're going to do, rather than (e.g.,) invoking qsub in a loop. This makes it easier to pause or delete your jobs, and also imposes less overhead on the scheduler. (In particular, the scheduler creates a directory for every job, so if you submit 10,000 jobs there will be 10,000 directories created before your job can start running. An array job containing 10,000 elements creates only a single directory.)
 
Deleted:
<
<
  • Array jobs are useful for submitting many similar jobs. In general, you should try to submit a single array job for each big run you're going to do, rather than (e.g.,) invoking qsub in a loop. This makes it easier to pause or delete your jobs, and also imposes less overhead on the scheduler.
  An example of an array job is the one-line script
   echo 'Hello world, number ' $SGE_TASK_ID
Line: 91 to 79
  on the command line, where the range 1-100 is chosen arbitrarily here. This will create a new array job with an automatically assigned job number <jobnumber> and 100 entries that is queued. Each entry of the array job will eventually run on a machine in the cluster - the <i>th entry will be called <jobnumber>.<i>. Sungrid Engine treats every entry of an array job as a single job, and when the <i>th entry is called assigns <i> to the variable $SGE_TASK_ID. You may use this variable to do arbitrarily complex things in your shell script - an easy option is to index a file and execute the <i>th line with the <i>th job.
Deleted:
<
<
 

How to monitor, control, and delete jobs

  • The command qstat is used to check the status of the queue. It lists all running and pending jobs. There is an entry for each entry of a running array job, whereas pending parts of array jobs are listed in one line. qstat -f gives detailed information for each cluster node, qstat -ext more detailed information for each job. Try man qstat for more options.
  • The command qmon can be used to get a graphical interface to monitor and control jobs. It's not great, though.
  • The command qdel can be used to delete (your own) jobs. (syntax: qdel <jobnumber>). You can also delete single entries of array jobs (syntax qdel <jobnumber>.<i>).
Added:
>
>

Priority Classes

If you use the qsub syntax above on arrow, your job will be assigned to the default priority class associated with your user account. This cannot be the priority class Urgent. To use another priority class than your default for a job, use the following syntax (note the capital P):

   qsub -cwd -o <outfiledir> -e <errorfiledir> -P <priorityclass> helloworld.sh
   
To submit in the urgent class, please use:
   qsub -cwd -o <outfiledir> -e <errorfiledir> -P Urgent -l urgent=1 helloworld.sh
   
As mentioned above, the total number of urgent jobs is limited to 10. This is even true if those are the only jobs on the cluster - in that case (which probably will never happen because many users will have stuff waiting) you can still fill the rest of the cluster with "normal" jobs.

CPLEX and MATLAB license management

If your job uses CPLEX or Matlab, add -l cplex=1 or -l matlab=1 to your qsub command. This will ensure that we don't run more jobs than there are licenses. Instructions on how to use CPLEX are found here. We're not exactly sure of the best way to run MATLAB without invoking X-windows, but the best we've been able to do is

   matlab -nojvm -nodisplay -nosplash < inputfile.m
   

Right now the cluster allows 22 simultaneous CPLEX processes and (separately) 10 simultaneous Matlab processes. Using the -l syntax above ensures that no more than this number of processes run. If you don't use the syntax, you run the risk of invoking more processes than there are licenses, causing some of your jobs to fail and thus requiring you to rerun jobs. The department as a whole has 100 CPLEX licenses, of which many are in use at any time; thus, running a Matlab job without the -l syntax runs a high chance of failure. We have 22 CPLEX licenses, but they should only be used on this cluster; thus again you'll need to use -l but you shouldn't have to worry about users elsewhere in the dept. To check the number of Matlab licenses available you can type:

   /cs/local/generic/lib/pkg/matlab-7.2/etc/lmstat -a
   
To change the number of available Matlab licenses when using -l matlab=1, an administrator has to change the cluster configuration; details on how to do this are on the administration page.
 

Where do Arrow jobs run?

Right here:

Revision 102006-10-13 - KevinLeytonBrown

Line: 1 to 1
 

Sun Grid Engine - quick user guide

Line: 75 to 75
  As mentioned above, the total number of urgent jobs is limited to 10. This is even true if those are the only jobs on the cluster - in that case (which probably will never happen because many users will have stuff waiting) you can still fill the rest of the cluster with "normal" jobs.
Changed:
<
<
  • CPLEX and MATLAB if your job uses CPLEX or Matlab, add -l cplex=1 or -l matlab=1 to your qsub command. This will ensure that we don't run more jobs than there are licenses. Instructions on how to use CPLEX are found here.
>
>
  • CPLEX and MATLAB if your job uses CPLEX or Matlab, add -l cplex=1 or -l matlab=1 to your qsub command. This will ensure that we don't run more jobs than there are licenses. Instructions on how to use CPLEX are found here. We're not exactly sure of the best way to run MATLAB without invoking X-windows, but the best we've been able to do is
       matlab -nojvm -nodisplay -nosplash < inputfile.m
       
 
  • Array jobs are useful for submitting many similar jobs. In general, you should try to submit a single array job for each big run you're going to do, rather than (e.g.,) invoking qsub in a loop. This makes it easier to pause or delete your jobs, and also imposes less overhead on the scheduler.
    An example of an array job is the one-line script

Revision 92006-10-12 - KevinLeytonBrown

Line: 1 to 1
 

SunGridEngine - quick user guide

Changed:
<
<

Introduction:

>
>

Introduction

  This page gives a quick overview of computational facilities available to users in BETA and LCI, and explains how to use them with the SunGridEngine scheduling software.
An extensive overview of all the features of SGE can be found at the Sun website.
Changed:
<
<

Available clusters:

>
>

Available clusters

 
  • beta cluster: 5 machines, two 2GHz CPUs each, 4GB memory, running Linux. Available for members of the beta lab.
    This should probably be used only via SGE (with a share-based scheduling system that will actually work, as opposed to the current first-come-first-serve scheme)
Line: 17 to 19
  Details about the machines, their configuration, and their names: Ganglia
Changed:
<
<

The Arrow cluster:

>
>

The Arrow cluster

  Jobs running on the arrow cluster belong to one of four priority classes. Jobs are scheduled (selected to be run) pre-emptively by priority class, and then evenly among users within a priority class. (Note that scheduling among users is done on the basis of CPU usage, not on the basis of the number of jobs submitted. Thus a user who submits many fast jobs will be scheduled more often than a user at the same priority class who submits many slow jobs.) Because of the preemptive scheduling, users submitting to a lower priority class may see high latency (it may take days or weeks before a queued job is scheduled). On the other hand, these lower priority jobs will be allocated all 100 CPUs when no higher-priority jobs are waiting. All users should feel free to submit as many jobs as they like (but rather use few big array jobs than many single jobs), as doing so will not interfere with the cluster's ability to serve its primary purpose.
Line: 30 to 32
 In order to submit in any priority class (even 'general'), access for that class must be explicitly granted to your user account. To request access, please contact Frank Hutter, Lin Xu or Kevin Leyton-Brown.
Changed:
<
<

How to submit jobs:

>
>

How to submit jobs

 
  • For the arrow cluster, add the line

Line: 87 to 89
  on the command line, where the range 1-100 is chosen arbitrarily here. This will create a new array job with an automatically assigned job number <jobnumber> and 100 entries that is queued. Each entry of the array job will eventually run on a machine in the cluster - the <i>th entry will be called <jobnumber>.<i>. Sungrid Engine treats every entry of an array job as a single job, and when the <i>th entry is called assigns <i> to the variable $SGE_TASK_ID. You may use this variable to do arbitrarily complex things in your shell script - an easy option is to index a file and execute the <i>th line with the <i>th job.
Changed:
<
<

How to monitor, control, and delete jobs:

>
>

How to monitor, control, and delete jobs

 
  • The command qstat is used to check the status of the queue. It lists all running and pending jobs. There is an entry for each entry of a running array job, whereas pending parts of array jobs are listed in one line. qstat -f gives detailed information for each cluster node, qstat -ext more detailed information for each job. Try man qstat for more options.
  • The command qmon can be used to get a graphical interface to monitor and control jobs. It's not great, though.
Line: 97 to 99
  Right here:
Changed:
<
<
>
>
 

Administration

Revision 82006-10-11 - FrankHutter

Line: 1 to 1
 

SunGridEngine - quick user guide

Introduction:

Line: 99 to 99
 
Added:
>
>

Administration

For details on how to administer the cluster (requires admin access), look at SunGridEngineAdmin.

 -- FrankHutter and Lin Xu - 02 May 2006

Revision 72006-08-04 - KevinLeytonBrown

Line: 1 to 1
 

SunGridEngine - quick user guide

Introduction:

Line: 93 to 93
 
  • The command qmon can be used to get a graphical interface to monitor and control jobs. It's not great, though.
  • The command qdel can be used to delete (your own) jobs. (syntax: qdel <jobnumber>). You can also delete single entries of array jobs (syntax qdel <jobnumber>.<i>).
Added:
>
>

Where do Arrow jobs run?

Right here:

 -- FrankHutter and Lin Xu - 02 May 2006

Revision 62006-05-25 - KevinLeytonBrown

Line: 1 to 1
 

SunGridEngine - quick user guide

Introduction:

Line: 73 to 73
  As mentioned above, the total number of urgent jobs is limited to 10. This is even true if those are the only jobs on the cluster - in that case (which probably will never happen because many users will have stuff waiting) you can still fill the rest of the cluster with "normal" jobs.
Changed:
<
<
  • CPLEX and MATLAB if your job uses CPLEX or Matlab, add -l cplex=1 or -l matlab=1 to your qsub command. This will ensure that we don't run more jobs than there are licenses.
>
>
  • CPLEX and MATLAB if your job uses CPLEX or Matlab, add -l cplex=1 or -l matlab=1 to your qsub command. This will ensure that we don't run more jobs than there are licenses. Instructions on how to use CPLEX are found here.
 
  • Array jobs are useful for submitting many similar jobs. In general, you should try to submit a single array job for each big run you're going to do, rather than (e.g.,) invoking qsub in a loop. This makes it easier to pause or delete your jobs, and also imposes less overhead on the scheduler.
    An example of an array job is the one-line script

Revision 52006-05-11 - FrankHutter

Line: 1 to 1
 

SunGridEngine - quick user guide

Introduction:

Line: 26 to 26
 
  1. eh (for empirical hardness): jobs which pertain to the particular project mentioned in the CFI grant under which the cluster was funded.
  2. ea (for empirical algorithmics): studies on the empirical properties of algorithms (i.e., "the 'E' in BETA"), but not part of the project described above.
  3. general: jobs which do not fall into one of the above categories. We ask that these jobs be relatively short in duration (although there can be arbitrarily many of them): since new jobs can only be scheduled when a processor becomes idle, excessively long low-priority jobs can lead to starvation of high-priority jobs.
Changed:
<
<
>
>
  1. low: jobs of particularly low priority. This priority class can be used to submit a huge amount of jobs that will give way to any other jobs in the queue if there are any.
 In order to submit in any priority class (even 'general'), access for that class must be explicitly granted to your user account. To request access, please contact Frank Hutter, Lin Xu or Kevin Leyton-Brown.

Revision 42006-05-03 - FrankHutter

Line: 1 to 1
 

SunGridEngine - quick user guide

Introduction:

Line: 19 to 19
 

The Arrow cluster:

Changed:
<
<
Jobs running on the arrow cluster belong to one of four priority classes. Jobs are scheduled (selected to be run) pre-emptively by priority class, and then evenly among users within a priority class. (Note that scheduling among users is done on the basis of CPU usage, not on the basis of the number of jobs submitted. Thus a user who submits many fast jobs will be scheduled more often than a user at the same priority class who submits many slow jobs.) Because of the preemptive scheduling, users submitting to a lower priority class may see high latency (it may take hours or days before a queued job is scheduled). On the other hand, these lower priority jobs will be allocated all 100 CPUs when no higher-priority jobs are waiting. All users should feel free to submit as many jobs as they like, as doing so will not interfere with the cluster's ability to serve its primary purpose.
>
>
Jobs running on the arrow cluster belong to one of four priority classes. Jobs are scheduled (selected to be run) pre-emptively by priority class, and then evenly among users within a priority class. (Note that scheduling among users is done on the basis of CPU usage, not on the basis of the number of jobs submitted. Thus a user who submits many fast jobs will be scheduled more often than a user at the same priority class who submits many slow jobs.) Because of the preemptive scheduling, users submitting to a lower priority class may see high latency (it may take days or weeks before a queued job is scheduled). On the other hand, these lower priority jobs will be allocated all 100 CPUs when no higher-priority jobs are waiting. All users should feel free to submit as many jobs as they like (but rather use few big array jobs than many single jobs), as doing so will not interfere with the cluster's ability to serve its primary purpose.
 
Changed:
<
<
The priority classes are:
>
>
Priority classes are:
 
  1. Urgent: intended for very occasional use, mostly around paper deadlines. The cluster is limited to 10 such jobs at any time.
Changed:
<
<
  1. EmpiricalHardness: jobs which pertain to the particular project mentioned in the CFI grant under which the cluster was funded.
  2. EmpiricalAlgorithmics: studies on the empirical properties of algorithms (i.e., "the 'E' in BETA"), but not part of the project described above.
>
>
  1. eh (for empirical hardness): jobs which pertain to the particular project mentioned in the CFI grant under which the cluster was funded.
  2. ea (for empirical algorithmics): studies on the empirical properties of algorithms (i.e., "the 'E' in BETA"), but not part of the project described above.
 
  1. general: jobs which do not fall into one of the above categories. We ask that these jobs be relatively short in duration (although there can be arbitrarily many of them): since new jobs can only be scheduled when a processor becomes idle, excessively long low-priority jobs can lead to starvation of high-priority jobs.

In order to submit in any priority class (even 'general'), access for that class must be explicitly granted to your user account. To request access, please contact Frank Hutter, Lin Xu or Kevin Leyton-Brown.

Line: 52 to 52
  qsub -cwd -o -e helloworld.sh on the command line. This will create a new job with an automatically assigned job number <jobnumber> that is queued and eventually run on a machine in the cluster.
Changed:
<
<
When the job runs, it will write output (stdout) to the file <outfiledir>/helloworld.sh.o<jobnumber> It will also create a file <errorfiledir>/helloworld.sh.e<jobnumber> and write stderr to that file. In the above case, "Hello world." will be written to the outfile and the errorfile will be empty. If you don't want output files (you can easily end up with thousands) specify /dev/null as
>
>
When the job runs, it will write output (stdout) to the file <outfiledir>/helloworld.sh.o<jobnumber> It will also create a file <errorfiledir>/helloworld.sh.e<jobnumber> and write stderr to that file. In the above case, "Hello world." will be written to the outfile and the errorfile will be empty. If you don't want output files (for array jobs you can easily end up with thousands) specify /dev/null as <>, and similarly for error files.
 

Tips on Submitting Jobs

Line: 63 to 63
  Due to the share-based scheduling we've set up, your overall share of computational time will not be larger if you submit larger jobs (e.g., shell scripts that invoke more runs of your program). However, while longer jobs will not increase your throughput, they will increase latency for other users.
On the arrow cluster, jobs that run longer than 25 hours are automatically killed, but rather try to keep jobs to minutes or hours. This makes it easiest for SGE to help share resources in a fair way.
Changed:
<
<
  • Priority Class: If you use the qsub syntax above on arrow, your job will be assigned to the general priority class. To use another priority class, use the following syntax (note the capital P):
>
>
  • Priority Class: If you use the qsub syntax above on arrow, your job will be assigned to the default priority class associated with your user account. This cannot be the priority class Urgent. To use another priority class than your default for a job, use the following syntax (note the capital P):
 
   qsub -cwd -o <outfiledir> -e <errorfiledir> -P <priorityclass> helloworld.sh
   

Revision 32006-05-03 - KevinLeytonBrown

Line: 1 to 1
 

SunGridEngine - quick user guide

Changed:
<
<
Introduction:
>
>

Introduction:

 
Changed:
<
<
This page gives a quick overview of the available computational facilities and how to use them with the SunGridEngine scheduling software.
>
>
This page gives a quick overview of computational facilities available to users in BETA and LCI, and explains how to use them with the SunGridEngine scheduling software.
 An extensive overview of all the features of SGE can be found at the Sun website.
Changed:
<
<
General policies:
>
>

Available clusters:

 
Changed:
<
<
  • Submit short jobs.
    SGE is not a load balancing software. Once a job is running it runs with 100% of a CPU until it's done.
    Thus, if you submit many long jobs, you will block the cluster for other users. Do not do that.
    Due to the share-based scheduling we've set up, your overall share of computational time will not be larger if you submit larger jobs.
    On the arrow cluster, jobs that run longer than 25 hours are automatically killed, but rather try to keep jobs to minutes or hours. This makes it easiest for SGE to help share resources in a fair way.

Available clusters:

  • beta cluster: 5 machines, two 2GHz CPUs each, 4GB memory, running Linux. Available for members of the beta lab.
>
>
  • beta cluster: 5 machines, two 2GHz CPUs each, 4GB memory, running Linux. Available for members of the beta lab.
  This should probably be used only via SGE (with a share-based scheduling system that will actually work, as opposed to the current first-come-first-serve scheme)
Changed:
<
<
  • icics cluster: 13 machines, two 3GHz CPUs each, 2GB memory, running Linux. Available for all members of the department.
>
>
  • icics cluster: 13 machines, two 3GHz CPUs each, 2GB memory, running Linux. Available for all members of the department.
  Some people run stuff locally on these machines. We could still use SGE on top of that (it dispatches jobs based on load), but there is no guarantee to get 100% CPU time on the nodes you're running on.
Changed:
<
<
  • arrow cluster: 50 machines, two 3.2 GHz CPUs each, 2GB memory, running Linux.
    This cluster is tied to Kevin Leyton-Brown's CFI grant for research on empirical hardness models. Because of this, jobs of this kind are required to have higher priority than other jobs. However, when no such jobs are running there are 100 CPUs up for grabs.
    This cluster is thus appropriate to run many (small) jobs that do not require a quick turn-around. When submitting to this cluster you have to expect to wait for weeks until your jobs run - but then you probably get the whole cluster for a couple of days.
    This cluster is not appropriate for jobs that require low latency. For special requests, such as very near paper deadlines, we set up a temporary urgent passing lane, jobs in which will be run first. To prevent urgent jobs from blocking the whole cluster, their total number is limited to 10. For details on this urgent lane, see UrgentJobsOnArrow. For comments on this usage policy, please contact Kevin Leyton-Brown.
>
>
  • arrow cluster: 50 machines, two 3.2 GHz CPUs each, 2GB memory, running Linux.
    This cluster is tied to Kevin Leyton-Brown's CFI grant for research on empirical hardness models, but is also available to other users in the department when it is idle.
  Details about the machines, their configuration, and their names: Ganglia
Changed:
<
<
How to submit jobs:
>
>

The Arrow cluster:

Jobs running on the arrow cluster belong to one of four priority classes. Jobs are scheduled (selected to be run) pre-emptively by priority class, and then evenly among users within a priority class. (Note that scheduling among users is done on the basis of CPU usage, not on the basis of the number of jobs submitted. Thus a user who submits many fast jobs will be scheduled more often than a user at the same priority class who submits many slow jobs.) Because of the preemptive scheduling, users submitting to a lower priority class may see high latency (it may take hours or days before a queued job is scheduled). On the other hand, these lower priority jobs will be allocated all 100 CPUs when no higher-priority jobs are waiting. All users should feel free to submit as many jobs as they like, as doing so will not interfere with the cluster's ability to serve its primary purpose.

The priority classes are:

  1. Urgent: intended for very occasional use, mostly around paper deadlines. The cluster is limited to 10 such jobs at any time.
  2. EmpiricalHardness: jobs which pertain to the particular project mentioned in the CFI grant under which the cluster was funded.
  3. EmpiricalAlgorithmics: studies on the empirical properties of algorithms (i.e., "the 'E' in BETA"), but not part of the project described above.
  4. general: jobs which do not fall into one of the above categories. We ask that these jobs be relatively short in duration (although there can be arbitrarily many of them): since new jobs can only be scheduled when a processor becomes idle, excessively long low-priority jobs can lead to starvation of high-priority jobs.

In order to submit in any priority class (even 'general'), access for that class must be explicitly granted to your user account. To request access, please contact Frank Hutter, Lin Xu or Kevin Leyton-Brown.

How to submit jobs:

 
  • For the arrow cluster, add the line

Line: 38 to 41
  source /cs/beta/lib/pkg/sge/beta_grid/common/settings.csh but we may completely get rid of this configuration and have it all in one. Currently, you work solely with the one cluster that is indicated by this line in the configuration file - there is no easy way to go back and forth (but this will hopefully change).
Changed:
<
<
  • ssh onto a submit host. For beta, submit hosts are ?, for arrow it is samos (which is now another name for arrow)
>
>
  • ssh onto a submit host. For beta, submit hosts are (we're not sure), for arrow it is samos (which is now another name for arrow)
 
  • A job is submitted to the cluster in the form of a shell (.sh) script.
  • You can either submit single or array jobs. An example for a single job would e.g. be the one-line script

Line: 49 to 52
  qsub -cwd -o -e helloworld.sh on the command line. This will create a new job with an automatically assigned job number <jobnumber> that is queued and eventually run on a machine in the cluster.
Changed:
<
<
When the job runs, it will write output (stdout) to the file <outfiledir>/helloworld.sh.o<jobnumber> It will also create a file <errorfiledir>/helloworld.sh.e<jobnumber> and write stderr to that file. In the above case, "Hello world." will be written to the outfile and the errorfile will be empty.
  • Array jobs are useful to submit many similar jobs. In general, you should prefer array jobs over multiple single jobs.
>
>
When the job runs, it will write output (stdout) to the file <outfiledir>/helloworld.sh.o<jobnumber> It will also create a file <errorfiledir>/helloworld.sh.e<jobnumber> and write stderr to that file. In the above case, "Hello world." will be written to the outfile and the errorfile will be empty. If you don't want output files (you can easily end up with thousands) specify /dev/null as

Tips on Submitting Jobs

  • Short Jobs Please!: To the extent possible, submit short jobs. We have not configured SGE as load balancing software, because allowing jobs to be paused or migrated can affect the accuracy of process timing. Once a job is running it runs with 100% of a CPU until it's done. Thus, if you submit many long jobs, you will block the cluster for other users. Due to the share-based scheduling we've set up, your overall share of computational time will not be larger if you submit larger jobs (e.g., shell scripts that invoke more runs of your program). However, while longer jobs will not increase your throughput, they will increase latency for other users.
    On the arrow cluster, jobs that run longer than 25 hours are automatically killed, but rather try to keep jobs to minutes or hours. This makes it easiest for SGE to help share resources in a fair way.

  • Priority Class: If you use the qsub syntax above on arrow, your job will be assigned to the general priority class. To use another priority class, use the following syntax (note the capital P):
       qsub -cwd -o <outfiledir> -e <errorfiledir> -P <priorityclass> helloworld.sh
       
    To submit in the urgent class, please use:
       qsub -cwd -o <outfiledir> -e <errorfiledir> -P Urgent -l urgent=1 helloworld.sh
       
    As mentioned above, the total number of urgent jobs is limited to 10. This is even true if those are the only jobs on the cluster - in that case (which probably will never happen because many users will have stuff waiting) you can still fill the rest of the cluster with "normal" jobs.

  • CPLEX and MATLAB if your job uses CPLEX or Matlab, add -l cplex=1 or -l matlab=1 to your qsub command. This will ensure that we don't run more jobs than there are licenses.

  • Array jobs are useful for submitting many similar jobs. In general, you should try to submit a single array job for each big run you're going to do, rather than (e.g.,) invoking qsub in a loop. This makes it easier to pause or delete your jobs, and also imposes less overhead on the scheduler.
  An example of an array job is the one-line script
   echo 'Hello world, number ' $SGE_TASK_ID
Line: 61 to 86
  on the command line, where the range 1-100 is chosen arbitrarily here. This will create a new array job with an automatically assigned job number <jobnumber> and 100 entries that is queued. Each entry of the array job will eventually run on a machine in the cluster - the <i>th entry will be called <jobnumber>.<i>. Sungrid Engine treats every entry of an array job as a single job, and when the <i>th entry is called assigns <i> to the variable $SGE_TASK_ID. You may use this variable to do arbitrarily complex things in your shell script - an easy option is to index a file and execute the <i>th line with the <i>th job.
Changed:
<
<
How to monitor, control, and delete jobs:
>
>

How to monitor, control, and delete jobs:

 
  • The command qstat is used to check the status of the queue. It lists all running and pending jobs. There is an entry for each entry of a running array job, whereas pending parts of array jobs are listed in one line. qstat -f gives detailed information for each cluster node, qstat -ext more detailed information for each job. Try man qstat for more options.
  • The command qmon can be used to get a graphical interface to monitor and control jobs. It's not great, though.
  • The command qdel can be used to delete (your own) jobs. (syntax: qdel <jobnumber>). You can also delete single entries of array jobs (syntax qdel <jobnumber>.<i>).
Deleted:
<
<
Urgent jobs on arrow If you have an urgent deadline and would like to use some CPUs with comparably low latency, please contact Kevin Leyton-Brown, Frank Hutter, or Lin Xu to be temporarily added as an urgent user. Once you are added as a temporary urgent user, you can submit jobs using qsub -P Urgent -l urgent=1 instead of qsub. As said above, the total number of urgent jobs is limited to 10. This is even true if those are the only jobs on the cluster - in that case (which probably will never happen because many users will have stuff waiting) you can still fill the rest of the cluster with "normal" jobs.
 -- FrankHutter and Lin Xu - 02 May 2006

Revision 22006-05-02 - FrankHutter

Line: 1 to 1
 

SunGridEngine - quick user guide

Introduction:

Changed:
<
<
This page gives a quick overview of the available computational facilities and how to use them with the SunGridEngine scheduling software.
>
>
This page gives a quick overview of the available computational facilities and how to use them with the SunGridEngine scheduling software.
An extensive overview of all the features of SGE can be found at the Sun website.
  General policies:
Line: 22 to 23
 
  • arrow cluster: 50 machines, two 3.2 GHz CPUs each, 2GB memory, running Linux.
    This cluster is tied to Kevin Leyton-Brown's CFI grant for research on empirical hardness models. Because of this, jobs of this kind are required to have higher priority than other jobs. However, when no such jobs are running there are 100 CPUs up for grabs.
    This cluster is thus appropriate to run many (small) jobs that do not require a quick turn-around. When submitting to this cluster you have to expect to wait for weeks until your jobs run - but then you probably get the whole cluster for a couple of days.
Changed:
<
<
This cluster is not appropriate for jobs that require low latency. For special requests, such as very near paper deadlines, we set up a temporary urgent passing lane jobs in which will be run first. To prevent urgent jobs from blocking the whole cluster, their total number is limited to 10.
>
>
This cluster is not appropriate for jobs that require low latency. For special requests, such as very near paper deadlines, we set up a temporary urgent passing lane, jobs in which will be run first. To prevent urgent jobs from blocking the whole cluster, their total number is limited to 10. For details on this urgent lane, see UrgentJobsOnArrow. For comments on this usage policy, please contact Kevin Leyton-Brown.
  Details about the machines, their configuration, and their names: Ganglia
Line: 38 to 39
  but we may completely get rid of this configuration and have it all in one. Currently, you work solely with the one cluster that is indicated by this line in the configuration file - there is no easy way to go back and forth (but this will hopefully change).
  • ssh onto a submit host. For beta, submit hosts are ?, for arrow it is samos (which is now another name for arrow)
Changed:
<
<
  • A job is submitted to the cluster in the form of a shell (.sh) script A simple script could e.g. be
>
>
  • A job is submitted to the cluster in the form of a shell (.sh) script.
  • You can either submit single or array jobs. An example for a single job would e.g. be the one-line script
 
   echo 'Hello world.'
   
Changed:
<
<
If this is the content of the file helloworld.sh, you submit it by typing
>
>
If this is the content of the file helloworld.sh, you can submit a job by typing
 
   qsub -cwd -o <outfiledir> -e <errorfiledir> helloworld.sh
   
Changed:
<
<
on the command line, where is the directory the job's output file (in this case containing "Hello world") is written to, and is the directory any error output is written to. The outfile will have the name helloworld.sh.o, the errorfile the name helloworld.sh.e (where the job number is automatically assigned by SGE). -- FrankHutter - 02 May 2006
>
>
on the command line. This will create a new job with an automatically assigned job number <jobnumber> that is queued and eventually run on a machine in the cluster. When the job runs, it will write output (stdout) to the file <outfiledir>/helloworld.sh.o<jobnumber> It will also create a file <errorfiledir>/helloworld.sh.e<jobnumber> and write stderr to that file. In the above case, "Hello world." will be written to the outfile and the errorfile will be empty.
  • Array jobs are useful to submit many similar jobs. In general, you should prefer array jobs over multiple single jobs.
    An example of an array job is the one-line script
       echo 'Hello world, number ' $SGE_TASK_ID
       
    If this is the content of the file many-helloworld.sh, you can submit an array job by typing
       qsub -cwd -o <outfiledir> -e <errorfiledir> -t 1-100 many-helloworld.sh
       
    on the command line, where the range 1-100 is chosen arbitrarily here. This will create a new array job with an automatically assigned job number <jobnumber> and 100 entries that is queued. Each entry of the array job will eventually run on a machine in the cluster - the <i>th entry will be called <jobnumber>.<i>. Sungrid Engine treats every entry of an array job as a single job, and when the <i>th entry is called assigns <i> to the variable $SGE_TASK_ID. You may use this variable to do arbitrarily complex things in your shell script - an easy option is to index a file and execute the <i>th line with the <i>th job.

How to monitor, control, and delete jobs:

  • The command qstat is used to check the status of the queue. It lists all running and pending jobs. There is an entry for each entry of a running array job, whereas pending parts of array jobs are listed in one line. qstat -f gives detailed information for each cluster node, qstat -ext more detailed information for each job. Try man qstat for more options.
  • The command qmon can be used to get a graphical interface to monitor and control jobs. It's not great, though.
  • The command qdel can be used to delete (your own) jobs. (syntax: qdel <jobnumber>). You can also delete single entries of array jobs (syntax qdel <jobnumber>.<i>).

Urgent jobs on arrow If you have an urgent deadline and would like to use some CPUs with comparably low latency, please contact Kevin Leyton-Brown, Frank Hutter, or Lin Xu to be temporarily added as an urgent user. Once you are added as a temporary urgent user, you can submit jobs using qsub -P Urgent -l urgent=1 instead of qsub. As said above, the total number of urgent jobs is limited to 10. This is even true if those are the only jobs on the cluster - in that case (which probably will never happen because many users will have stuff waiting) you can still fill the rest of the cluster with "normal" jobs.

-- FrankHutter and Lin Xu - 02 May 2006

 

Revision 12006-05-02 - FrankHutter

Line: 1 to 1
Added:
>
>

SunGridEngine - quick user guide

Introduction:

This page gives a quick overview of the available computational facilities and how to use them with the SunGridEngine scheduling software.

General policies:

  • Submit short jobs.
    SGE is not a load balancing software. Once a job is running it runs with 100% of a CPU until it's done.
    Thus, if you submit many long jobs, you will block the cluster for other users. Do not do that.
    Due to the share-based scheduling we've set up, your overall share of computational time will not be larger if you submit larger jobs.
    On the arrow cluster, jobs that run longer than 25 hours are automatically killed, but rather try to keep jobs to minutes or hours. This makes it easiest for SGE to help share resources in a fair way.

Available clusters:

  • beta cluster: 5 machines, two 2GHz CPUs each, 4GB memory, running Linux. Available for members of the beta lab.
    This should probably be used only via SGE (with a share-based scheduling system that will actually work, as opposed to the current first-come-first-serve scheme)
  • icics cluster: 13 machines, two 3GHz CPUs each, 2GB memory, running Linux. Available for all members of the department.
    Some people run stuff locally on these machines. We could still use SGE on top of that (it dispatches jobs based on load), but there is no guarantee to get 100% CPU time on the nodes you're running on.
  • arrow cluster: 50 machines, two 3.2 GHz CPUs each, 2GB memory, running Linux.
    This cluster is tied to Kevin Leyton-Brown's CFI grant for research on empirical hardness models. Because of this, jobs of this kind are required to have higher priority than other jobs. However, when no such jobs are running there are 100 CPUs up for grabs.
    This cluster is thus appropriate to run many (small) jobs that do not require a quick turn-around. When submitting to this cluster you have to expect to wait for weeks until your jobs run - but then you probably get the whole cluster for a couple of days.
    This cluster is not appropriate for jobs that require low latency. For special requests, such as very near paper deadlines, we set up a temporary urgent passing lane jobs in which will be run first. To prevent urgent jobs from blocking the whole cluster, their total number is limited to 10.

Details about the machines, their configuration, and their names: Ganglia

How to submit jobs:

  • For the arrow cluster, add the line
       source /cs/beta/lib/pkg/sge-6.0u7_1/default/common/settings.csh
       
    to your configuration file (e.g. ~/csh_init/.cshrc). For the beta cluster, the appropriate line to add is
       source /cs/beta/lib/pkg/sge/beta_grid/common/settings.csh
       
    but we may completely get rid of this configuration and have it all in one. Currently, you work solely with the one cluster that is indicated by this line in the configuration file - there is no easy way to go back and forth (but this will hopefully change).
  • ssh onto a submit host. For beta, submit hosts are ?, for arrow it is samos (which is now another name for arrow)
  • A job is submitted to the cluster in the form of a shell (.sh) script A simple script could e.g. be
       echo 'Hello world.'
       
    If this is the content of the file helloworld.sh, you submit it by typing
       qsub -cwd -o <outfiledir> -e <errorfiledir> helloworld.sh
       
    on the command line, where is the directory the job's output file (in this case containing "Hello world") is written to, and is the directory any error output is written to. The outfile will have the name helloworld.sh.o, the errorfile the name helloworld.sh.e (where the job number is automatically assigned by SGE).
-- FrankHutter - 02 May 2006
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2025 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback