Hprc banner tamu.png

Difference between revisions of "HPRC:CommonProblems"

From TAMU HPRC
Jump to: navigation, search
(Q: What is "Disk Quota Exceeded"?)
(Q: Why is my program slow?)
Line 52: Line 52:
 
===Q: How long is my job going to take?===
 
===Q: How long is my job going to take?===
 
===Q: Why is my program slow?===
 
===Q: Why is my program slow?===
 +
'''A:''' While using one core:
 +
* Supercomputers ("clusters") are not large single-core entities. A cluster is a collection of CPUs, In Ada's case, the CPUs are mostly 10-core Intel Xeon processors based on the Ivy-Bridge microarchitecture. This means they are likely similar to what you would use in most "regular" computers. You should not expect a huge performance gain across a single core between a "regular" computer and Ada. In order to see a performance gain on Ada, programs and simulations will need to be parallelized to run on multiple cores. '''IMPORTANT NOTE:''' If your program or simulation is not written to be parallelized it will either not work at all or waste SUs.
 +
'''A:''' While using multiple cores:
 +
* If you find that your program or simulation is running particularly slowly on Ada, you may be experiencing parallel slowdown. This happens when the overhead from communication is greater than the time spent running a program. Trying to further parallelize your program will continue to slow it down.
 +
* '''SOLUTION:''' Reduce the amount of parallelization in your program until you find the "sweet spot" in which you have the most significant speed-up. If you cannot achieve any speed-up from parallelization, it is best to run your program serially.
 +
 
===Q: What is "Disk Quota Exceeded"?===
 
===Q: What is "Disk Quota Exceeded"?===
 
'''A:''' This message refers to one or more of your file quotas being reached.  
 
'''A:''' This message refers to one or more of your file quotas being reached.  

Revision as of 11:00, 25 July 2016

Common Problems & Quick Solutions

Accounts

Q: When do accounts expire?

A: Accounts expire at the start of the new fiscal year (September 1st). You can see when your account expires by going to our Account Management System (AMS) and checking under the Accounts tab.

Q: How do I get more SUs?

A: Students will need to have their PI transfer SUs to them. PIs can apply for up to two Small accounts for not more than 200,000 collective SUs. After this Small allocation has run out, PIs will need to apply for a Large allocation. See our Account Allocations page for more information on the allocation policies.

Q: How do I transfer SUs?

A: To transfer SUs, PIs will need a Small or Large account (see our Account Allocations page for more information). Once an account has been granted to the PI, they can transfer SUs to any of their researchers on our Account Management System (AMS). If a PI needs to add a new researcher, the PI must contact the Help Desk.

Batch Processing

Q: Why is my job pending?

A: There can be many reasons why a job would be pending:

  • Your job cannot fit on any of our nodes
    • If your job requests more than 245GB of memory, without requesting the xlarge queue, your job will be stuck pending.
    • SOLUTION: Kill your job and resubmit with less memory or in the xlarge queue. IMPORTANT NOTE: Your program MUST use Westmere compatible software to be able to run in the xlarge queue.
    • If your job requests more than 2TB of memory, your job will be stuck pending.
    • SOLUTION: Kill your job and resubmit with less memory.
    • If your job asks for more than the maximum number of cores per node (Ada: 20 or 40 with the xlarge queue, Curie: 16) with #BSUB -R "span[ptile=XX]" your job will be stuck pending.
    • SOLUTION: Kill your job and resubmit with a ptile value less than or equal to the maximum value for the cluster.
  • There are no job slots available
    • If your job requires the usage of the 256GB, 1TB, or 2TB nodes, your job might be pending for longer than usual.
    • If the cluster usage is particularly high right now, your job might be pending for longer than usual. You can see the System Load Levels on our Home Page.

Q: Why does my job fail?

A: There can be many reasons why your job fails. ALWAYS check your job (LSF) and program output files for information regarding why your job might have failed.

  • Wrong file format
    • If you edited your file on a Windows computer prior to using it on Ada, your file may be in the wrong format.
    • If you see errors in your output file caused by whitespace characters, your file may be in the wrong format.
    • SOLUTION: Try the dos2unix utility on your file and submit again.
  • Your job ran out of time
    • If you see "TERM_RUNLIMIT" in your job output file, your job ran out of time.
    • SOLUTION: Increase your wall time specification #BSUB -W HH:MM and submit again.
  • Your job ran out of memory
    • If you see "TERM_MEMLIMIT" in your job output file, your job ran out of memory.
    • SOLUTION: Increase your memory specifications #BSUB -R rusage[mem=XX] and #BSUB -M XX and submit again.
  • You ran out of space
    • If you see "DISK QUOTA EXCEEDED" in your output file, you ran out of disk space.
    • Remember to check your quotas regularly with showquota.
    • SOLUTION: Clear out your directories and submit again.

Q: How much memory do I need?

Q: How many cores should I use?

Q: How long is my job going to take?

Q: Why is my program slow?

A: While using one core:

  • Supercomputers ("clusters") are not large single-core entities. A cluster is a collection of CPUs, In Ada's case, the CPUs are mostly 10-core Intel Xeon processors based on the Ivy-Bridge microarchitecture. This means they are likely similar to what you would use in most "regular" computers. You should not expect a huge performance gain across a single core between a "regular" computer and Ada. In order to see a performance gain on Ada, programs and simulations will need to be parallelized to run on multiple cores. IMPORTANT NOTE: If your program or simulation is not written to be parallelized it will either not work at all or waste SUs.

A: While using multiple cores:

  • If you find that your program or simulation is running particularly slowly on Ada, you may be experiencing parallel slowdown. This happens when the overhead from communication is greater than the time spent running a program. Trying to further parallelize your program will continue to slow it down.
  • SOLUTION: Reduce the amount of parallelization in your program until you find the "sweet spot" in which you have the most significant speed-up. If you cannot achieve any speed-up from parallelization, it is best to run your program serially.

Q: What is "Disk Quota Exceeded"?

A: This message refers to one or more of your file quotas being reached.

  • Remember to check your quotas regularly with showquota.
  • SOLUTION: Clear out your problem directories of any unnecessary files.
  • For more information on filesystems and quotas, please refer to this page.