This page is part of the EmpiricalAlgorithmics web.
This page gives a quick overview of computational facilities available to users in BETA and LCI, and explains how to use them with the SunGridEngine scheduling software.
An extensive overview of all the features of SGE can be found at the Sun website.
Note: As of Jan-05-2011, samos & ganglia no longer appear to be avaliable, and the submit host for arrow is arrowhead.cs.ubc.ca
Details about the machines, their configuration, and their names: Ganglia
Jobs running on the arrow cluster belong to one of four priority classes. Jobs are scheduled (selected to be run) pre-emptively by priority class, and then evenly among users within a priority class. (Note that scheduling among users is done on the basis of CPU usage, not on the basis of the number of jobs submitted. Thus a user who submits many fast jobs will be scheduled more often than a user at the same priority class who submits many slow jobs.) Because of the preemptive scheduling, users submitting to a lower priority class may see high latency (it may take days or weeks before a queued job is scheduled). On the other hand, these lower priority jobs will be allocated all 100 CPUs when no higher-priority jobs are waiting. All users should feel free to submit as many jobs as they like (but rather use few big array jobs than many single jobs), as doing so will not interfere with the cluster's ability to serve its primary purpose.
Note: much of the information on priority classes are out of date. The section will be updated soon. Priority classes are:
source /cs/sungridengine/arrow/common/settings.cshto your configuration file (e.g.
~/csh_init/cshrc
), or add source /cs/beta/lib/pkg/sge-6.0u7_1/default/common/settings.cshto access the old arrow cluster. For the beta cluster, simply type the following command in a shell:
source /cs/sungridengine/beta-icics/common/settings.sh
echo 'Hello world.'If this is the content of the file helloworld.sh, you can submit a job by typing
qsub -cwd -o <outfiledir> -e <errorfiledir> helloworld.shon the command line. This will create a new job with an automatically assigned job number <jobnumber> that is queued and eventually run on a machine in the cluster. When the job runs, it will write output (stdout) to the file <outfiledir>/helloworld.sh.o<jobnumber> It will also create a file <errorfiledir>/helloworld.sh.e<jobnumber> and write stderr to that file. In the above case, "Hello world." will be written to the outfile and the errorfile will be empty. If you don't want output files (for array jobs you can easily end up with thousands) specify /dev/null as <
qsub -cwd -o . -e . ~kevinlb/World/helloWorld.sh
qsub -cwd -o /dev/null -e /dev/null script.sh
qsub -cwd -o /dev/null -e /dev/null -q icics.q script.sh
#!/bin/sh #$ -S /bin/shor it has only the first line #!/bin/sh, and you call it like this:
qsub -cwd -S /bin/sh script.sh
#!/cs/local/bin/perl #$ -S /cs/local/bin/perlor you can add -S /cs/local/bin/perl to your qsub command and remove the second line in the script (the first one still needs to be there).
Array jobs are useful for submitting many similar jobs. In general, you should try to submit a single array job for each big run you're going to do, rather than (e.g.,) invoking qsub in a loop. This makes it easier to pause or delete your jobs, and also imposes less overhead on the scheduler. (In particular, the scheduler creates a directory for every job, so if you submit 10,000 jobs there will be 10,000 directories created before your job can start running. An array job containing 10,000 elements creates only a single directory.)
An example of an array job is the one-line script
echo 'Hello world, number ' $SGE_TASK_IDIf this is the content of the file many-helloworld.sh, you can submit an array job by typing
qsub -cwd -o <outfiledir> -e <errorfiledir> -t 1-100 many-helloworld.shon the command line, where the range 1-100 is chosen arbitrarily here. This will create a new array job with an automatically assigned job number <jobnumber> and 100 entries that is queued. Each entry of the array job will eventually run on a machine in the cluster - the <i>th entry will be called <jobnumber>.<i>. Sungrid Engine treats every entry of an array job as a single job, and when the <i>th entry is called assigns <i> to the variable $SGE_TASK_ID. You may use this variable to do arbitrarily complex things in your shell script - an easy option is to index a file and execute the <i>th line with the <i>th job.
If you use the qsub syntax above on arrow, your job will be assigned to the default priority class associated with your user account. This cannot be the priority class Urgent or the class kpmUrgent. To use another priority class than your default for a job, use the following syntax (note the capital P):
qsub -cwd -o <outfiledir> -e <errorfiledir> -P <priorityclass> helloworld.shTo submit in the urgent class, please use:
qsub -cwd -o <outfiledir> -e <errorfiledir> -P Urgent -l urgent=1 helloworld.shTo submit in the kpmUrgent class, please use:
qsub -cwd -o <outfiledir> -e <errorfiledir> -P kpmUrgent -l kpmUrgent=1 helloworld.shAs mentioned above, the total number of urgent jobs is limited to 10. This is even true if those are the only jobs on the cluster - in that case (which probably will never happen because many users will have stuff waiting) you can still fill the rest of the cluster with "normal" jobs.
If your job uses CPLEX or Matlab, add -l cplex=1
or -l matlab=1
to your qsub command. This will ensure that we don't run more jobs than there are licenses. Instructions on how to use CPLEX are found here. We're not exactly sure of the best way to run MATLAB without invoking X-windows, but the best we've been able to do is
matlab -nojvm -nodisplay -nosplash < inputfile.mAnother good approach is to use the Matlab compiler. The advantage of this is that compiled Matlab code does not require licenses to run.
Right now the cluster allows 22 simultaneous CPLEX processes and (separately) 20 simultaneous Matlab processes. Using the -l syntax above ensures that no more than this number of processes run. If you don't use the syntax, you run the risk of invoking more processes than there are licenses, causing some of your jobs to fail and thus requiring you to rerun jobs. The department as a whole has 100 Matlab licenses, of which many are in use at any time; thus, running a Matlab job without the -l syntax runs a high chance of failure. If there is a high level of Matlab usage in the department, it's also possible that you will run out of licenses even if you do use the -l flag, as this doesn't actually reserve the licenses for you. We have 22 CPLEX licenses, but they should only be used on this cluster; thus again you'll need to use -l but you shouldn't have to worry about users elsewhere in the dept. To check the number of Matlab licenses available you can type:
/cs/local/generic/lib/pkg/matlab-7.2/etc/lmstat -aTo change the number of available Matlab licenses when using -l matlab=1, an administrator has to change the cluster configuration; details on how to do this are on the administration page.
If you want to manually limit the number of processes that can run simultaneously to a different number, you can use -l max10=1
, -l max20=1
, -l max30=1
or -l max40=1
. This will allow you to run 10, 20, 30 or 40 jobs at once. (Note that if multiple people running simultaneous jobs both use the same flag, the number of simultaneous jobs will be limited across all users.)
memheavy
. You use this just like the Matlab or CPLEX consumables, i.e. -l memheavy=1
.
Only one job using this consumable will be scheduled on each machine.
fillup
. This will assign 2 (i.e., both) slots on the same machine to your job. You use this by specifying -pe fillup 2
in your submit command.
The [-h] option will hold your array of jobs in the queue until there are NO other jobs running on the SGE. This isn't really recommended unless you want to be REALLY nice and make sure NO-ONE is using the cluster when you do.
The best option is to use the [-hold_jid job_identifier_list], which will hold your array of jobs in the queue until a SPECIFIC job(s) is finished, defined by the job_identifier_list. So, to manually limit the number of jobs you are running (lets say 10 jobs), split your array up to blocks of 10 and submit them one at a time with the [-hold_jid jobID] option, where jobID is the ID of the previous block of 10. This will ensure that you never use more than 10 machines at a time, and it will also let you use any 10 machines.
Right here:
The Arrow cluster was funded under a CFI grant. This grant requires us to complete annual reports explaining how this infrastructure is being used. This information is summarized on the page ArrowClusterCFISummaries.
How to set up CVS.
For details on how to administer the cluster (requires admin access), look at SunGridEngineAdmin. For software installed on arrow (or software you need installed there), see ArrowSoftware.
-- FrankHutter and Lin Xu - 02 May 2006