Sun Grid Engine Administration

For general information about using Sun Grid Engine, see SunGridEngine.

Checking the queque

qstat gives basic information about the jobs in the queque
qstat -ext gives a bit of extra information, such as a job's project
qstat -j <jobnumber> gives you very detailed information about that job

Is there a way to change the project of a running job? That could be useful someday...

Override Policy

This is the primary mechanism that is used to determine which jobs are dispatched. Go to Policy Configuration and then Override Policy. Choose "project" from the dropdown, and you'll see all the SGE project names with the number of override tickets they get. They should always be multiples of 10,000: this ensures that override tickets trump share tree tickets (of which there are 9,000, as set on the main policy configuration page). Higher priority projects preempt lower priority ones--as long as their tickets are multiples of 10,000, no jobs will be run from a lower-priority project while pending jobs from a higher-priority project exist. You can modify the number of tickets a project is given here, but you can't create a new one. To do that, go to "Project Configuration" from the main qmon dialog.

Share Tree Policy

This is the policy that is used to determine how competing jobs are scheduled when they fall within the same project. To change the share tree policy, go to Policy Configuration and click on Share Tree Policy. Right now, for each project (eh, ea, Urgent, etc), there is a node in the graph with a leaf called default. (If you don't see the leaf, double-click on the node to open it up.) Under this default leaf, SGE automatically adds all users in that project--they're listed inside. What this does is to ensure that all users in the project get the same priority, so that SGE will ensure that each user gets the same amount of CPU time (not the same number of jobs) within the same time window. Of course, the share tree policy doesn't have to share resources evenly. You can add another leaf named after a specific user to give them extra shares (they're proportional to the entry for Shares).

Adding a new user

Go to User Configuration, click on Userset, select the appropriate Userset, click on modify and enter the username. The following usersets are mutually exclusive (a user only needs to be added to one, and should be added to the one which is highest on the list to which he belongs):

  • EH-Models
  • EmpiricalAlgorithmics
  • KLB
  • KPM
  • General
The other usersets are different: a user who belongs to one of the above sets can also belong to them. It is important to add every user to the userset called arrow--if you do not, the user will be able to submit jobs but they may not ever run. Users need to belong to urgentusers to submit urgent jobs.

Adding a new user set

Go to User Configuration, click on Userset, make sure Department is chosen in the lower left, and click on Add. Then add people as described above (if you want to transfer people from other user sets you have to delete them from those first and then add them to the new one). You can then associate the new user set with projects users are eligible to submit to.

Adding a new project

Go to Project Configuration and click on Add. Enter the name of the project and choose user sets or users who are eligible to submit jobs to this project by clicking on the buttons below. E.g., say you want to add a user set: click on the left button, and in the new window that pops up, choose the applicable user sets.

Consumables

Changing existing consumables

What if you want to change the number of available matlab licenses, urgent queues or CPLEX instances? You would be tempted to go into "complex configuration" and change the value "default" on the consumable's definition. However, this doesn't work. (I think all this does is determine how many units of the consumable get used by requests to use the consumable that don't specify a number of units.) Instead, go to "Host configuration", then choose the "execution host" tab and select the host "global". Then under "consumables/fixed attributes" you'll see the consumables: matlab, cplex, urgent. Change the totals here!

How to find out how many available matlab licenses there are? In UNIX, type:

   /cs/local/generic/lib/pkg/matlab-7.2/etc/lmstat -a
   
The 'matlab-7.2' part may change as new versions of matlab become available...

There are 22 CPLEX licenses bought as part of the CFI grant that purchased the cluster. Unless the department buys more someday, that's it...

Creating new consumables

If you want to create a new consumable, you do want to go to "complex configuration". Give it a name and nickname, make it "int" and "<=", consumable, requestable and unforced, default=0, urgent=0. Then you'll need to set the number of available units through host configuration as above. However, the new consumable won't yet appear as a consumable in the "consumables/fixed attributes" pane to the right when you click on "global" in "execution host". How do you get it there? This is possibly the awesomest interface feat in SGE yet. Click on "modify" (with "global" selected). Click the "Consumable/Fixed Attribute" tab. There's the list of consumables--how do you get a new one to appear? Just click the "name" header (that's right!). You can figure it out from there.

Maximum array job instances and tasks

These variables (max_aj_instances and max_aj_tasks) are in cluster configuration. Max_aj_instances used to be 20,000; KLB changed it to 100,000 on 10/13/06 because we seemed to have hit the maximum. (This made the cluster essentially unresponsive for about an hour afterwards; I'm not sure it was a good idea...) Max_aj_tasks is 1,000,000.

Test queue

There is a queue for testing purposes set up to run only on arrow01, which is not part of the regular queue. To use this queue, add the syntax -q arrowtest.q and -P eh2 to the qsub command. Your jobs should dispatch immediately as the queue is usually empty. There may be other jobs running on arrow01, however it's OK to overload this (and only this) machine.

Change log

Let's keep a log of administration changes made to the cluster, to help us undo bad changes if they occur.

  • kevinlb, 4/11/07: added the flag "batch" to the arrowtest.q queue, in order to allow batch jobs to be submitted. Now as far as I can tell the test queue works.

-- FrankHutter - 11 Oct 2006

Edit | Attach | Watch | Print version | History: r14 | r9 < r8 < r7 < r6 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r7 - 2007-04-11 - KevinLeytonBrown
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2025 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback