Hprc banner tamu.png

Difference between revisions of "Ada:Batch Job Submission"

From TAMU HPRC
Jump to: navigation, search
(Job Submission: the bsub command)
(Job Submission: the bsub command)
Line 1: Line 1:
==Job Submission: the bsub command==
+
== Job Submission ==
<pre>
+
Once you have your job file ready, it is time to submit your job. You can submit your job to LSF with the following command:
bsub < jobfile                  # Submits specified job for processing by LSF
+
[ NetID@ada ~]$ '''bsub < ''MyJob.LSF'''''
</pre>
+
Verifying job submission parameters...
 +
Verifying project account...
 +
      Account to charge:  123456789123
 +
          Balance (SUs):      4824.7811
 +
          SUs to charge:        5.0000
 +
Job <12345> is submitted to default queue <sn_regular>.
  
Here is an illustration,
+
After a job has been submitted, you may want to check on its progress or cancel it. Below is a list of the most used job monitoring and control commands for jobs on Terra.
  
<pre>
+
{| class="wikitable" style="text-align: center;"
$ bsub < sample1.job
+
|+ Ada (LSF) Job Monitoring and Control Commands
Verifying job submission parameters...
+
! style="width: 200pt;" | Function
Job <224139> is submitted to default queue <devel>.
+
! style="width: 200pt;" | Command
</pre>
+
! style="width: 150pt;" | Example
 +
|-
 +
|Submit a job
 +
|bsub < [script_file]
 +
|bsub < MyJob.LSF
 +
|-
 +
|Cancel/Kill a job
 +
|bkill [Job_ID]
 +
|bkill 101204
 +
|-
 +
|Check summary status of a single job
 +
|bjobs [job_id]
 +
|bjobs 101204
 +
|-
 +
|Check summary status of all <br> jobs for a user
 +
|bjobs -u [user_name]
 +
|bjobs -u adaUser1
 +
|-
 +
|Check detailed status of a single job
 +
|bjobs -l [job_id]
 +
|bjobs -l 101204
 +
|-
 +
|Modify job submission options
 +
|bmod [bsub_options] [job_id]
 +
|bmod -W 2:00 101204
 +
|}
  
The first thing LSF does upon submission is to tag your job with a numeric identifier, a job id.
+
For more information on any of the commands above, please see their respective ''man'' pages.
Above, that identifier is '''224139'''. You will need it in order to track or manage (kill or modify)
+
   [ NetID@ada ~]$ '''man bmod'''
your jobs. Next, note that the default current working directory for the job is the directory
 
you submitted the job from. If that's not what you need, you must explicitly indicate that, as we
 
do above when we cd into a specific directory. On job completion, LSF will place in the submission
 
directory the file stdout1.224139. It contains a log of job events and other data directed to
 
standard out. Always inspect this file for useful information.<br>
 
 
 
By default, a job executes under the environment of the submitting process. This you can change
 
by using the '''-L shell''' option (see below) and/or by specifying at the start of the job script
 
the shell that will execute it. For example, if you want the job to execute under the C-shell, the first
 
command above the #BSUB directives should be #!/bin/csh.
 
 
 
 
 
===Job tracking and control commands===
 
 
 
<pre>
 
bjobs [-u all or user_name] [[-l] job_id]    # displays job information per user(s) or job_id, in summary or detail (-l) form, respectively.
 
bpeek [-f] job_id                            # displays the current contents of stdout and stderr output of an executing job.
 
bkill job_id                                # kills, suspends, or resumes unfinished jobs. See man bkill for details.
 
bmod [bsub_options]  job_id                # Modifies job submission options of a job. See man bmod for details.
 
lsload [node_name]                          # Lists on std out a node's utilization. Use bjobs -l jobid
 
                                            # to get the names of nodes associated with a jobid. See man lsload for details.
 
</pre>
 
 
 
All of the above have decent man pages, if you're interested in more detail.
 
 
 
'''Examples'''
 
<pre>
 
$ bjobs -u all
 
JOBID      STAT  USER            QUEUE      JOB_NAME            NEXEC_HOST SLOTS RUN_TIME        TIME_LEFT
 
223537    RUN   adinar          long      NOR_Q                1          20    400404 second(s) 8:46 L
 
223547    RUN  adinar          long      NOR_Q                1          20    399830 second(s) 8:56 L
 
223182    RUN  tengxj1025      long      pro_at16_lowc        10        280  325922 second(s) 5:27 L
 
229307    RUN  natalieg        long      LES_MORE            3          900  225972 second(s) 25:13 L
 
229309    RUN  tengxj1025      long      pro_atat_lowc        7          280  223276 second(s) 33:58 L
 
229310    RUN  tengxj1025      long      cg16_lowc            5          280  223228 second(s) 33:59 L
 
. . .            . . .    . . .
 
 
 
$ bjobs -l 229309
 
 
 
Job <229309>, Job Name <pro_atat_lowc>, User <tengxj1025>, Project <default>, M
 
                          ail <czjnbb@gmail.com>, Status <RUN>, Queue <long>, J
 
                          ob Priority <250000>, Command <## job name;#BSUB -J p
 
                          ro_atat_lowc; ## send stderr and stdout to the same f
 
                          ile ;#BSUB -o info.%J; ## login shell to avoid copyin
 
                          g env from login session;## also helps the module fun
 
                          ction work in batch jobs;#BSUB -L /bin/bash; ## 30 mi
 
                          nutes of walltime ([HH:]MM);#BSUB -W 96:00; ## numpro
 
                          cs;#BSUB -n 280; . . .
 
                          . . .
 
 
 
RUNLIMIT
 
5760.0 min of nxt1449
 
Tue Nov  4 21:34:43 2014: Started on 280 Hosts/Processors <nxt1449> <nxt1449> <
 
                          nxt1449> <nxt1449> <nxt1449> <nxt1449>  ...
 
                          . . .
 
 
 
Execution
 
                          CWD </scratch/user/tengxj1025/EXTD/pro_atat/lowc/md>;
 
Fri Nov  7 12:05:55 2014: Resource usage collected.
 
                          The CPU time used is 67536997 seconds.
 
                          MEM: 44.4 Gbytes;  SWAP: 0 Mbytes;  NTHREAD: 862
 
 
 
                          HOST: nxt1449
 
                          MEM: 3.2 Gbytes;  SWAP: 0 Mbytes; CPU_TIME: 9004415 s
 
                          econds . . .
 
                          . . .
 
                          . . .
 
 
 
 
 
$ bmod -W 46:00 229309            # resets wall-clock time to 46 hrs for job 229309
 
 
 
 
 
</pre>
 
 
 
 
 
'''Node Utilization.''' It may happen that a job uses its allocated nodes inefficiently.
 
Sometimes this is unavoidable, but many times it is very avoidable. It is unavoidable, for instance, if the
 
amount of memory used per node is a large fraction of the total for that node, and only 1 cpu is used. In
 
that case, cpu utilization will be at best at 5% (1/20) in a regular node. A handy tool, more practical than lsload,
 
for tracking node utilization is the '''lnu''' homegrown command.
 
 
 
<pre>
 
lnu [-h] [-l] -j jobid          # lists on stdout the utilization across all nodes for an executing job. See examples below.
 
</pre>
 
 
 
'''Examples'''
 
 
 
<pre>
 
$ lnu -l -j 795375
 
Job          User                Queue        Status Node  Cpus
 
795375      jomber23            medium            R    4    80 
 
        HOST_NAME      status  r15s  r1m  r15m  ut    pg  ls    it  tmp  swp  mem    Assigned Cores
 
        nxt1417            ok  20.0  21.0  21.0  97%  0.0  0 94976  366M  3.7G 41.6G    20
 
        nxt1764 (L)        ok  19.7  20.0  20.0  95%  0.0  0 95040  366M  3.7G 41.5G    20
 
        nxt2111            ok  20.0  20.0  20.0  98%  0.0  0 91712  370M  4.2G 41.5G    20
 
        nxt2112            ok  20.0  21.1  21.0  97%  0.0  0 91712  370M  4.2G 41.6G    20
 
=========================================================================================================
 
 
 
$ lnu -l -j 753454
 
Job          User                Queue        Status Node  Cpus
 
753454      ajochoa              long              R    1    20 
 
        HOST_NAME      status  r15s  r1m  r15m  ut    pg  ls    it  tmp  swp  mem    Assigned Cores
 
        nxt1222 (L)        ok  4.3  4.5  6.2  20%  0.0  0 54464  422M  4.7G 52.9G    20
 
=========================================================================================================
 
 
 
</pre>
 
 
 
The utilization ('''ut''') and memory paging ('''pg'''), overall, are probably the most significant. Note that the
 
'''tmp, swp,''' and '''mem''' refer to ''available'' amounts respectively.
 

Revision as of 11:19, 9 January 2017

Job Submission

Once you have your job file ready, it is time to submit your job. You can submit your job to LSF with the following command:

[ NetID@ada ~]$ bsub < MyJob.LSF
Verifying job submission parameters...
Verifying project account...
     Account to charge:   123456789123
         Balance (SUs):      4824.7811
         SUs to charge:         5.0000
Job <12345> is submitted to default queue <sn_regular>.

After a job has been submitted, you may want to check on its progress or cancel it. Below is a list of the most used job monitoring and control commands for jobs on Terra.

Ada (LSF) Job Monitoring and Control Commands
Function Command Example
Submit a job bsub < [script_file] bsub < MyJob.LSF
Cancel/Kill a job bkill [Job_ID] bkill 101204
Check summary status of a single job bjobs [job_id] bjobs 101204
Check summary status of all
jobs for a user
bjobs -u [user_name] bjobs -u adaUser1
Check detailed status of a single job bjobs -l [job_id] bjobs -l 101204
Modify job submission options bmod [bsub_options] [job_id] bmod -W 2:00 101204

For more information on any of the commands above, please see their respective man pages.

 [ NetID@ada ~]$ man bmod