Hprc banner tamu.png

Difference between revisions of "HPRC:CommonProblems"

From TAMU HPRC
Jump to: navigation, search
(Q: Why does my job fail?)
(Q: Why does my job fail?)
Line 25: Line 25:
  
 
===Q: Why does my job fail?===
 
===Q: Why does my job fail?===
'''A:''' There can be many reasons why your job fails. Check your job (LSF) output file for information regarding why your job might have failed. If there are no errors other than an exit code, you may have to check your program output as well.
+
'''A:''' There can be many reasons why your job fails. ALWAYS heck your job (LSF) and program output files for information regarding why your job might have failed.  
  
 
* '''Wrong file format'''
 
* '''Wrong file format'''

Revision as of 11:24, 21 July 2016

Common Problems & Quick Solutions

Accounts

Q: When do accounts expire?

A: Accounts expire at the start of the new fiscal year (September 1st). You can see when your account expires by going to our Account Management System (AMS) and checking under the Accounts tab.

Q: How do I get more SUs?

A: Students will need to have their PI transfer SUs to them. PIs can apply for up to two Small accounts for not more than 200,000 collective SUs. After this Small allocation has run out, PIs will need to apply for a Large allocation. See our Account Allocations page for more information on the allocation policies.

Q: How do I transfer SUs?

A: To transfer SUs, PIs will need a Small or Large account (see our Account Allocations page for more information). Once an account has been granted to the PI, they can transfer SUs to any of their researchers on our Account Management System (AMS). If a PI needs to add a new researcher, the PI must contact the Help Desk.

Batch Processing

Q: Why is my job pending?

A: There can be many reasons why a job would be pending:

  • Your job cannot fit on any of our nodes
    • If your job requests more than 245GB of memory, without requesting the xlarge queue, your job will be stuck pending. To fix this, kill your job and resubmit with less memory or in the xlarge queue. IMPORTANT NOTE: Your program MUST use Westmere compatible software to be able to run in the xlarge queue.
    • If your job asks for more than the maximum number of cores per node (Ada: 20 or 40 with the xlarge queue, Curie: 16) with #BSUB -R "span[ptile=XX]" your job will be stuck pending. To fix this, kill your job and resubmit with a ptile value less than or equal to the maximum value for the cluster.
    • If your job requests more than 2TB of memory, your job will be stuck pending. To fix this, kill your job and resubmit with less memory.
  • There are no job slots available
    • If your job requires the usage of the 256GB, 1TB, or 2TB nodes, your job might be pending for longer than usual.
    • If the cluster usage is particularly high right now, your job might be pending for longer than usual. You can see the System Load Levels on our Home Page.

Q: Why does my job fail?

A: There can be many reasons why your job fails. ALWAYS heck your job (LSF) and program output files for information regarding why your job might have failed.

  • Wrong file format
    • If you edited your file on a Windows computer prior to using it on Ada, your file may be in the wrong format.
    • If you see errors in your output file caused by whitespace characters, your file may be in the wrong format.
    • SOLUTION: Try the dos2unix utility on your file and submit again.
  • Your job ran out of time
    • If you see "TERM_RUNLIMIT" in your job output file, your job ran out of time.
    • SOLUTION: Increase your wall time "#BSUB -W HH:MM" and submit again.
  • Your job ran out of memory
    • If you see "TERM_MEMLIMIT" in your job output file, your job ran out of memory.
    • SOLUTION: Increase your memory specifications "#BSUB -R rusage[mem=XX]" and "#BSUB -M XX" and submit again.
  • You ran out of space
    • If you see "DISK QUOTA EXCEEDED" in your output file, you ran out of disk space.
    • Check your quotas regularly with showquota.
    • SOLUTION: Clear out your directories and submit again.

Q: How much memory do I need?

Q: How many cores should I use?

Q: How long is my job going to take?

Q: Why is my program slow?

Q: What is "Disk Quota Exceeded"?