Hprc banner tamu.png

Difference between revisions of "HPRC:CommonProblems"

From TAMU HPRC
Jump to: navigation, search
(Software)
(Software)
Line 135: Line 135:
 
'''Note:''' The full module name, including the version number, is required to load specific modules. Use '''module spider''' to find the full module name.<br>
 
'''Note:''' The full module name, including the version number, is required to load specific modules. Use '''module spider''' to find the full module name.<br>
 
<font color=teal>More information on the module system can be found on our [[SW:Modules|Modules]] page.</font>
 
<font color=teal>More information on the module system can be found on our [[SW:Modules|Modules]] page.</font>
 +
 +
===Q: How many [blank] licenses are available?===
 +
'''A:''' On our clusters we have a license status checker tool in order to see how many licenses are currently in use and how many are available. To check the license status of a certain software, use '''license_status -s'''
 +
[NetID@cluster ~]$ '''license_status -s ''[package name]'''''
 +
<font color=teal>More information for this tool can be found on our [[SW:License_Checker|License Checker]] page.</font>
  
 
===Q: The software I need is not installed, what can I do?===
 
===Q: The software I need is not installed, what can I do?===

Revision as of 14:15, 1 February 2017

Common Problems & Quick Solutions

Accounts

Q: When do accounts expire?

A: Accounts expire at the start of the new fiscal year (September 1st). You can see when your account expires by going to our Account Management System (AMS) and checking under the Accounts tab.

Q: How do I get more SUs?

A: Students will need to have their PI transfer SUs to them. PIs can apply for up to two Startup accounts, each for up to 200,000 SUs and for not more than 400,000 collective SUs. After this Startup allocation has run out, PIs will need to apply for a Research allocation. More information on the allocation policies can be found on our Account Allocations page.

Q: I just received my SUs, how can I use them?

A: When you have received your SUs, you will need to either change/set your default account or request in your job file that a certain account will be used.

  • To change your defualt account, use the myproject utility on our systems. More information on the myproject utility can be found on our AMS User Interface page.
[NetID@cluster ~]$  myproject -d XXXXXXXXXX
Your default project account now is XXXXXXXXXX.
  • To request a certain account in your job file, add the following line to the directives section of your job file:
#BSUB -P XXXXXXXXXX

Q: How do I transfer SUs?

A: To transfer SUs, PIs will need a Small or Large account (see our Account Allocations page for more information). Once an account has been granted to the PI, they can transfer SUs to any of their researchers on our Account Management System (AMS). If a PI needs to add a new researcher, the PI must contact the Help Desk.

Ada Batch Processing

Q: Why is my job pending?

A: There can be many reasons why a job would be pending:

  • The job cannot fit on any of our nodes
    • If the job requests more than 245GB of memory per node, without requesting the xlarge queue, it will be stuck pending.
    • SOLUTION: Kill the job and resubmit with less memory or in the xlarge queue. IMPORTANT NOTE: The program MUST use Westmere compatible software to be able to run in the xlarge queue.
    • If the job requests more than 2TB of memory per node, it will be stuck pending.
    • SOLUTION: Kill the job and resubmit with less memory.
    • If the job asks for more than the maximum number of cores per node with #BSUB -R "span[ptile=XX]" it will be stuck pending. On Ada, the maximum number of cores per node is 20 on the regular nodes and 40 on the xlarge nodes.
    • SOLUTION: Kill the job and resubmit with a ptile value less than or equal to the maximum value for the cluster.
  • There are no job slots available
    • If the job requires the usage of the 256GB, 1TB, or 2TB nodes, it might be pending for longer than usual.
    • If the cluster usage is particularly high right now, jobs might be pending for longer than usual. The System Load Levels are available on our Home Page.

Q: Why does my job fail?

A: There can be many reasons why a job fails. ALWAYS check the job output file that is created by the batch system and any program output files for information regarding why a job might have failed.

  • Wrong file format
    • If a file has been edited on a Windows computer prior to using it on our clusters, the file may be in the wrong format.
    • TIP: Use the file command to check if the file has CRLF line terminators. If it does, the file is in the wrong format.
    • SOLUTION: Try the dos2unix utility on the file and submit again.
[NetID@cluster ~]$ file myFile
myFile: ASCII English text, with CRLF line terminators
[NetID@cluster ~]$ dos2unix myFile
dos2unix: converting file myFile to UNIX format ...
[NetID@cluster ~]$ file myFile
myFile: ASCII English text
  • The job ran out of time
    • If "TERM_RUNLIMIT" appears in the job output file, the job ran out of time.
    • SOLUTION: Increase the wall time specification #BSUB -W HH:MM and submit again.
  • The job ran out of memory
    • If "TERM_MEMLIMIT" appears in the job output file, the job ran out of memory.
    • SOLUTION: Increase the memory specifications #BSUB -R rusage[mem=XX] and #BSUB -M XX and submit again.
  • Not enough space
    • If "DISK QUOTA EXCEEDED" appears in the output file, there is not enough disk space to complete the job.
    • All users are encouraged to check their quotas regularly with showquota.
    • SOLUTION: See the question below for how to deal with DISK QUOTA EXCEEDED errors.

Q: Why is my program slow?

A: While using one core:

  • Supercomputers ("clusters") are not large single-core entities. A cluster is a collection of CPUs. Each CPU is likely similar to what one would use in most "regular" computers. A huge performance gain should not be expected when using a single core on one of our clusters versus a "regular" computer. In order to see a performance gain, programs and simulations will need to be parallelized to run on multiple cores.

A: While using multiple cores:

  • If a program or simulation is running particularly slowly, it may be experiencing parallel slowdown. This happens when the overhead from communication is greater than the time spent running a program. Trying to further parallelize the program will continue to slow it down.
  • SOLUTION: Reduce the amount of parallelization in the program until the program's "sweet spot" in which it has the most significant speed-up. If no speed-up can be achieved from parallelization, it might be best to run the program serially.
  • IMPORTANT NOTE: If the program or simulation is not written to be parallelized, it will either not work at all or waste SUs.

Terra Batch Processing

Q: Why is my job pending?

A: There can be many reasons why a job would be pending:

  • The job would run over the maximum runtime for the queue
    • If a job asks for more than 7 days, the job will remain pending.
    • If a queue was reqested in the job file and the requested runtime is longer than the maximum of that queue, the job will remain pending.
    • Queue information, including maximum runtime, can be found on our Terra Batch Processing page.
    • SOLUTION: Kill the job and resubmit with a shorter runtime or in a different queue.
  • There are no job slots available
    • If the job requires the usage of the 128GB (GPU) nodes, it might be pending for longer than usual.
    • If the cluster usage is particularly high right now, jobs might be pending for longer than usual. The System Load Levels are available on our Home Page.

Q: Why does my job fail?

A: There can be many reasons why a job fails. ALWAYS check the job output file that is created by the batch system and any program output files for information regarding why a job might have failed.

  • Wrong file format
    • If a file has been edited on a Windows computer prior to using it on our clusters, the file may be in the wrong format.
    • TIP: Use the file command to check if the file has CRLF line terminators. If it does, the file is in the wrong format.
    • SOLUTION: Try the dos2unix utility on the file and submit again.
[NetID@cluster ~]$ file myFile
myFile: ASCII English text, with CRLF line terminators
[NetID@cluster ~]$ dos2unix myFile
dos2unix: converting file myFile to UNIX format ...
[NetID@cluster ~]$ file myFile
myFile: ASCII English text
  • The job ran out of time
    • If "CANCELLED ... DUE TO TIME LIMIT" appears in the job output file, the job ran out of time.
    • SOLUTION: Increase the wall time specification #SBATCH -t HH:MM:SS and submit again.
  • The job ran out of memory
    • If "CANCELLED ... DUE TO MEMORY LIMIT" appears in the job output file, the job ran out of memory.
    • SOLUTION: Increase the memory specification #SBATCH --mem=XX and submit again.
  • Not enough space
    • If "DISK QUOTA EXCEEDED" appears in the output file, there is not enough disk space to complete the job.
    • All users are encouraged to check their quotas regularly with showquota.
    • SOLUTION: See the question below for how to deal with DISK QUOTA EXCEEDED errors.

Q: Why is my program slow?

A: While using one core:

  • Supercomputers ("clusters") are not large single-core entities. A cluster is a collection of CPUs. Each CPU is likely similar to what one would use in most "regular" computers. A huge performance gain should not be expected when using a single core on one of our clusters versus a "regular" computer. In order to see a performance gain, programs and simulations will need to be parallelized to run on multiple cores.

A: While using multiple cores:

  • If a program or simulation is running particularly slowly, it may be experiencing parallel slowdown. This happens when the overhead from communication is greater than the time spent running a program. Trying to further parallelize the program will continue to slow it down.
  • SOLUTION: Reduce the amount of parallelization in the program until the program's "sweet spot" in which it has the most significant speed-up. If no speed-up can be achieved from parallelization, it might be best to run the program serially.
  • IMPORTANT NOTE: If the program or simulation is not written to be parallelized, it will either not work at all or waste SUs.

Software

Q: Is [blank] software installed on the clusters?

A: To see if the software is available on the clusters, use module avail:

[NetID@cluster ~]$ module avail [package name]

This will show a list of all the available software matching this name. The command module spider can be used to search for available software:

[NetID@cluster ~]$ module spider [package name]

More information on the module system can be found on our Modules page.

Q: How do I load [blank] software?

A: Our clusters use a module system to manage software. This means that to use the software, the proper modules must be loaded first. To load a module, use module load:

[NetID@cluster ~]$ module load [package name]

Note: The full module name, including the version number, is required to load specific modules. Use module spider to find the full module name.
More information on the module system can be found on our Modules page.

Q: How many [blank] licenses are available?

A: On our clusters we have a license status checker tool in order to see how many licenses are currently in use and how many are available. To check the license status of a certain software, use license_status -s

[NetID@cluster ~]$ license_status -s [package name]

More information for this tool can be found on our License Checker page.

Q: The software I need is not installed, what can I do?

A: If a particular software is not already installed on the cluster, you can contact us regarding the installation of this software. If the software requires a license which we do not already have, you or your department will need to provide your own license to be able to use the software on the cluster. In general we try to provide as much software as possible for our users. However, this is not always possible, nor is it always possible in a timely manner. If you need a software that is not installed on the cluster, you are also able to install it for yourself on your Scratch directory. However, this is only recommended for experienced users.
Note: We are unable to install Windows only software/packages on our clusters.
Please account for delays in your installation request timeline.

Q: I have a license server for [blank] software, can I use this software on your clusters?

A: If you have a license for a particular software which you would like to use on the clusters, you will need to contact us with that information. We will need the name and version of the software you will be using along with the license file and the host name of your license server.

Q: I do not know how to use [blank] software, can you help me?

A: We have documentation on the software page regarding some of our software, however, these are more for getting started running jobs on the cluster, not necessarily using the software. In a lot of cases, we do not have a lot of experience using the software that is provided on our clusters. That being said, it is often best to consult the user guide of a particular software if you are having trouble using the software. We can always try to provide assistance, but in some cases we will only be able to provide as much help as the user guide for that software provides.

Other

Q: What is "Disk Quota Exceeded"?

A: This message refers to one or more file quotas being reached.

  • Users are advised to check their quotas regularly with showquota.
  • SOLUTION: Clear out the problem directories of any unnecessary files.
  • More information on file systems and quotas can be found on our file systems page.

Q: Why does my program stop after 1 hour on the login nodes?"

A: Since the login nodes are resources which are constantly shared by many users, we must enforce limits on computing on the login nodes in order to prevent irresponsible usage. One of these limits is on CPU time. Users are limited to ONE HOUR of CPU time per login session. If you need more than one hour of CPU time, you will need to submit a job to the batch system. More information on batch processing can be found on our batch processing page. You are expected to be responsible and courteous to other users when using software on the login nodes.