Sun Grid Engine Administration

This page is part of the EmpiricalAlgorithmics web.

Sun Grid Engine Administration

For general information about using Sun Grid Engine, see SunGridEngine.

Checking the queue

qstat gives basic information about the jobs in the queque
qstat -ext gives a bit of extra information, such as a job's project
qstat -j <jobnumber> gives you very detailed information about that job

Is there a way to change the project of a running job? That could be useful someday...

Override Policy

This is the primary mechanism that is used to determine which jobs are dispatched. Go to Policy Configuration and then Override Policy. Choose "project" from the dropdown, and you'll see all the SGE project names with the number of override tickets they get. They should always be multiples of 10,000: this ensures that override tickets trump share tree tickets (of which there are 9,000, as set on the main policy configuration page). Higher priority projects preempt lower priority ones--as long as their tickets are multiples of 10,000, no jobs will be run from a lower-priority project while pending jobs from a higher-priority project exist. You can modify the number of tickets a project is given here, but you can't create a new one. To do that, go to "Project Configuration" from the main qmon dialog.

Share Tree Policy

This is the policy that is used to determine how competing jobs are scheduled when they fall within the same project. To change the share tree policy, go to Policy Configuration and click on Share Tree Policy. Right now, for each project (eh, ea, Urgent, etc), there is a node in the graph with a leaf called default. (If you don't see the leaf, double-click on the node to open it up.) Under this default leaf, SGE automatically adds all users in that project--they're listed inside. What this does is to ensure that all users in the project get the same priority, so that SGE will ensure that each user gets the same amount of CPU time (not the same number of jobs) within the same time window. Of course, the share tree policy doesn't have to share resources evenly. You can add another leaf named after a specific user to give them extra shares (they're proportional to the entry for Shares).

Adding a new user

Go to User Configuration, click on Userset, select the appropriate Userset, click on modify and enter the username. Currently, the only user set maintained is eh; we might make others in the future.

Adding a new user set

Go to User Configuration, click on Userset, make sure Department is chosen in the lower left, and click on Add. Then add people as described above (if you want to transfer people from other user sets you have to delete them from those first and then add them to the new one). You can then associate the new user set with projects users are eligible to submit to.

Adding a new project

Go to Project Configuration and click on Add. Enter the name of the project and choose user sets or users who are eligible to submit jobs to this project by clicking on the buttons below. E.g., say you want to add a user set: click on the left button, and in the new window that pops up, choose the applicable user sets.

Parallel environments

A parellel environment defines a schema for how multiple-CPU jobs are to be run. Run a job in a parallel environment by adding "-pe " to the qsub command. For example, add "-pe fillup 2" to run a job which reserves 2 slots on the same host for job execution. The "fillup" environment was created so that multi-threaded, CPU-intensive jobs do not "clobber" other jobs placed on the same host. More complex parallel environments are likely required for MPI, etc., jobs, but no such environments have been configured yet.

Parallel environments can be created using "qconf -ap ", modified with "qconf -mp ", and listed with "qconf -spl". A new parallel environment must have name its added to the "pe_list" variable of some queue ("gconf -mq ") before being usable. More information is found at http://wikis.sun.com/display/gridengine62u2/Managing+Parallel+Environments

Consumables

Changing existing consumables

What if you want to change the number of available matlab licenses, urgent queues or CPLEX instances? You would be tempted to go into "complex configuration" and change the value "default" on the consumable's definition. However, this doesn't work. (I think all this does is determine how many units of the consumable get used by requests to use the consumable that don't specify a number of units.) Instead, go to "Host configuration", then choose the "execution host" tab and select the host "global". Then under "consumables/fixed attributes" you'll see the consumables: matlab, cplex, urgent. Change the totals here!

How to find out how many available matlab licenses there are? In UNIX, type:

   /cs/local/generic/lib/pkg/matlab-7.2/etc/lmstat -a

The 'matlab-7.2' part may change as new versions of matlab become available...

There are 22 CPLEX licenses bought as part of the CFI grant that purchased the cluster. Unless the department buys more someday, that's it...

Creating new consumables

If you want to create a new consumable, you do want to go to "complex configuration". Give it a name and nickname, make it "int" and "<=", consumable, requestable and unforced, default=0, urgent=0. Then you'll need to set the number of available units through host configuration as above. However, the new consumable won't yet appear as a consumable in the "consumables/fixed attributes" pane to the right when you click on "global" in "execution host". How do you get it there? This is possibly the awesomest interface feat in SGE yet. Click on "modify" (with "global" selected). Click the "Consumable/Fixed Attribute" tab. There's the list of consumables--how do you get a new one to appear? Just click the "name" header (that's right!). You can figure it out from there.

Creating for each machine

Frank created the memheavy consumable that limits memory-intensive jobs to one per machine. Consumables like this can be implemented by giving each single machine a single consumable. I.e., in "Host configuration, choose "Execution host", and then choose a single machine, such as arrow01.cs.ubc.ca, instead of global. As above, click on modify and put in your consumable and value 1. Unfortunately, this has to be done for each single machine in turn frown When new machines come in don't forget to give them such a consumable, too.

Maximum array job instances and tasks

These variables (max_aj_instances and max_aj_tasks) are in cluster configuration. Max_aj_instances used to be 20,000; KLB changed it to 100,000 on 10/13/06 because we seemed to have hit the maximum. (This made the cluster essentially unresponsive for about an hour afterwards; I'm not sure it was a good idea...) Max_aj_tasks is 1,000,000.

Test queue

There is a queue for testing purposes set up to run only on arrow01, which is not part of the regular queue. To use this queue, add the syntax -q test.q and -P eh2 to the qsub command. Your jobs should dispatch immediately as the queue is usually empty. There may be other jobs running on arrow01, however it's OK to overload this (and only this) machine.

Change log

Let's keep a log of administration changes made to the cluster, to help us undo bad changes if they occur.

kevinlb, 4/11/07: added the flag "batch" to the arrowtest.q queue, in order to allow batch jobs to be submitted. Now as far as I can tell the test queue works.
cnell, 3/31/10: added "fillup" parallel environment to all.q in order to allow jobs which reserve a whole machine.

-- FrankHutter - 11 Oct 2006

Raw edit | More topic actions

Topic revision: r14 - 2010-09-14 - FrankHutter