Line: 1 to 1 | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
WestGrid - quick user guideThis page is part of the EmpiricalAlgorithmics web.IntroductionWestGrid operates high performance computing (HPC), collaboration and visualization infrastructure across western Canada. It encompasses 14 partner institutions across four provinces.An extensive overview of WestGrid can be found at the WestGrid website. You also can read the QuickStart Guide for New Users at http://www.westgrid.ca/support/quickstart/new_users How to get a WestGrid account?
What will you get in the next a few days?After you submit your application, you will get a few e-mails from WestGrid.
How to transfer my files to/between WestGrid?Assume my host machine is okanagan.cs.ubc.ca, my WestGrid storage server is silo.westgrid.ca and my cluster in WestGrid is glacier.westgrid.ca. The file I want transfer istest.txt .
Running JobsA great majority of the computational work on WestGrid systems is carried out through non-interactive batch processing. Job scripts containing commands to be executed are submitted from a login server to a batch job handling system, which queues the requests, allocates processors and starts and manages the jobs. The system software that handles your batch jobs consists of two pieces: a resource manager (TORQUE) and a scheduler (Moab). This system is fairly similar to our SunGridEngine. For detailed information, please visit http://westgrid.ca/support/running_jobs. A batch job script is a text file of commands for the UNIX shell to interpret, similar to what you could execute by typing directly at a keyboard. The job is submitted to an queue using theqsub command. A job will wait in the queue depending on factors such as system load and the priority assigned to the job. When appropriate resources become available to run a job, it started on one or more assigned processors. A job will be terminated if it exceeds its allotted time limit, or, on some systems, if it exceeds memory limits. By default, the standard output and error streams from the job are directed to files in the directory from which the job was submitted. For detailed information of how to write a job script, please visit http://westgrid.ca/support/running_jobs#directives
A few useful commands:
| |||||||||||||||||
Changed: | |||||||||||||||||
< < |
| ||||||||||||||||
> > |
| ||||||||||||||||
Scheduler (Fairshare & RAC)The WestGrid job scheduler is priority queue with a back fill mechanism. The scheduler will dispatch the highest priority job in the "eligible jobs" queue if there are sufficient resources for it to run. If there are insufficient resources to submit the highest priority job, the scheduler will find the next highest priority job whose execution will not overlap with the approximate* earliest start of the original job (* since jobs can finish before their time cutoff the scheduler is using an upper bound of the earliest start time for a job). A job's priority is a weighted sum of processor equivalent hours discounted over a 10 day time period.Requested ResourcesThe resource that affects dispatching is processor-equivalent hours. | |||||||||||||||||
Changed: | |||||||||||||||||
< < | Processor-equivalent hours refers to number of processors your job will take away from the pool of resources. With small memory jobs, processor-equivalent hours are the same as processor hours, however, with high memory jobs, the memory left on a node becomes insufficient for the other processors to be utilized. the QDR nodes have 24 GB for 12 processors. Therefore there is 2GB for processor. If you use X GB, you will be counted as using max(# processors requested, X/2). | ||||||||||||||||
> > | Processor-equivalent hours refers to number of processors your job will "take away" from the pool of resources. With small memory jobs, processor-equivalent hours are the same as processor hours, however, with high memory jobs the memory left on a node may become insufficient for the other processors to be utilized. The QDR nodes have 24 GB for 12 processors. Therefore there is 2GB for processor. If you use X GB, you will be counted as using max(# processors requested, X/2). | ||||||||||||||||
Fairshare (& RAC)A user's (or account's) fairshare value is a weighted average of cluster usage in a set of disjoint time windows. For example; Orcinus and Glacier use 7 time windows that each last 36 hours with the following weights:
(FS Weight) * (FS User Weight) * ((FS User Target) - (FS User Value)) * (FS Account Weight) * ((FS Account Target) - (FS Account Value))
(FS Weight) * (FS User Weight) * ((FS User Target) - (FS User Value)) * (FS Account Weight) * MAX(0, (FS Account Target) - (FS Account Value))(difference => accounts with an RAC are not penalized for going over target) On Orcinus and Glacier:
Priority of jobs within a single user's queueUsing qstat -u (username), you can look at your queue. Below is an example queueing state:6111297 v7q8 Running 1 3:04:25 Thu Jan 14 17:46:04 6111294 v7q8 Running 1 3:04:25 Thu Jan 14 17:46:04 6111295 v7q8 Running 1 3:04:25 Thu Jan 14 17:46:04 6111293 v7q8 Running 1 3:04:25 Thu Jan 14 17:46:04 6111296 v7q8 Running 1 3:04:25 Thu Jan 14 17:46:04 5 active jobs 5 of 9616 processors in use by local jobs (0.5%) 919 of 931 nodes active (98.71%) eligible jobs---------------------- JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME 6111302 v7q8 Idle 1 6:00:00 Tue Jan 12 13:10:49 6111299 v7q8 Idle 1 6:00:00 Tue Jan 12 13:10:49 6111300 v7q8 Idle 1 6:00:00 Tue Jan 12 13:10:49 6111298 v7q8 Idle 1 6:00:00 Tue Jan 12 13:10:49 6111301 v7q8 Idle 1 6:00:00 Tue Jan 12 13:10:49 5 eligible jobs blocked jobs----------------------- JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME 6111303 v7q8 Idle 1 6:00:00 Tue Jan 12 13:10:49 6111304 v7q8 Idle 1 6:00:00 Tue Jan 12 13:10:49 6111305 v7q8 Idle 1 6:00:00 Tue Jan 12 13:10:49 6111306 v7q8 Idle 1 6:00:00 Tue Jan 12 13:10:49 6111307 v7q8 Idle 1 6:00:00 Tue Jan 12 13:10:49 5 blocked jobsThe jobs with "eligible" status are the only jobs which the dispatch system considers for allocation. The dispatch system doesn't know about your "blocked" jobs until they are upgraded to have "eligible" job status. The dispatcher will take the highest priority job among all eligible jobs and dispatch if there are sufficient resources. As described earlier, other lower priority jobs may ony be dispatched if their walltimes do not exceed the minimum startime of higher priority jobs (minimum of runtime of currentlty running jobs that would free enough space for higher priority job). This means that the user must wait for high resource "eligible" jobs to be dispatched before lower resource "blocked" jobs can be. Tracking Dispatching and PriorityTo see the usage of your group or individual account, first navigate to:cd /global/system/info/
Within the ./fair_share subdirectory, there are files with "fair share" information for every day in the current month and further subdirectories containing files going back 5 years.
Each file contains the % usage info for all accounts and users according to the time window weighting scheme describes above. If you grep the file for your account and user, you will get something like so:
FSInterval % Target 0 1 2 3 4 5 6 ------------- ACCT ------------- gdx-911-ae 11.46 7.50+ 12.62 15.47 12.12 7.61 10.05 11.89 8.78 gdx-911-aa* 2.11 0.50- 1.40 3.28 3.90 3.33 1.27 0.00 ------- USER ------------- v7q8 3.90 ------- 4.58 5.81 4.49 1.78 3.75 4.16 1.71Account gdx-911-ae has 11.46 % process equivalent usage over orcinus weighted over the 7 36 hour time windows. The target usage of the dispatch system for account gdx-911-ae is 7.50+ (at least 7.5) processor equivalent percentage of orcinus. As of January, 2016, orcinus has 10,000 processors. Thus 7.5% allocation means the account should have ~750 processor allocated to it at any one time. Within the ./stats subdirectory, contains files with detailed usage information for all the accounts. If you grep for your accounts, the file will display usage as so:
|--------- Active ------|---------------------------- Completed -------------------------------| acct Jobs Procs ProcHours Jobs % PHReq % PHDed % FSTgt AvgXF AvgQH gdx-911-ae 107 481 13981.3 373 8.9 2.35K 0.16 8.95K 2.05 2.00+ 14.5 70.2 gdx-911-aa 73 292 27376.5 -0 -0.0 ------ -0.00 7.65K 1.75 2.00 -0.0 -0.0The number of jobs, processors (Procs), and processor equivalent hours (ProcHours) are show for active jobs. The "Completed" sections shows job stats for completed jobs aggregated over the whole year. PHReq corresponds to process hours requested. PHDed corresponds to prcoess hours dedciated to our group??? Disk QuotaYour disk quotas are based on the number of files, and not just the amount of disk space you use. To check your quota on orcinus, type the following in a directory within your filesystemlfs quota -u v7q8 ./ | |||||||||||||||||
Changed: | |||||||||||||||||
< < | A handy short script to see the number of files in your directories recursively is show below. The script obtains the number of files in all subdirectories and displays the 50 largest directories in terms of number of files. | ||||||||||||||||
> > | A handy short script to see the number of files in your directories recursively is shown below. The script obtains the number of files in all subdirectories and displays the 50 largest directories in terms of number of files. | ||||||||||||||||
find / -xdev -type d -print0 | while IFS= read -d '' dir; do echo "$(find "$dir" -maxdepth 1 -print0 | grep -zc .) $dir" done | sort -rn | head -50When managing disk usage, it may be useful to identify the most memory intensive file system locations. The command below sorts the folders in the directory by memory usage and prints the top 10 heaviest. This command could take substantial time to execute if there are a large number of files in the sub directories. du -sch .[!.]* * | sort -h -r | head -n 10
\ No newline at end of file |