Hprc banner tamu.png

HPRC:CommonProblems

From TAMU HPRC
Jump to: navigation, search

Common Problems & Quick Solutions

Accounts

Q: When do accounts expire?

A: Accounts expire at the start of the new fiscal year (September 1st). You can see when your account expires by going to our Account Management System (AMS) and checking under the Accounts tab.

Q: How do I get more SUs?

A: Students will need to have their PI transfer SUs to them. PIs can apply for up to two Startup accounts, each for up to 200,000 SUs and for not more than 400,000 collective SUs. After this Startup allocation has run out, PIs will need to apply for a Research allocation. More information on the allocation policies can be found on our Account Allocations page.

Q: I just received my SUs, how can I use them?

A: When you have received your SUs, you will need to either change/set your default account or request in your job file that a certain account will be used.

  • To change your defualt account, use the myproject utility on our systems. More information on the myproject utility can be found on our AMS User Interface page.
[NetID@cluster ~]$  myproject -d XXXXXXXXXX
Your default project account now is XXXXXXXXXX.
  • To request a certain account in your job file, add the following line to the directives section of your job file:

On Ada:

#BSUB -P XXXXXXXXXX

On Terra:

#SBATCH -account=XXXXXXXXXX

Q: How do I set my default account?

A: When you have received your SUs, you will need to either change/set your default account or request in your job file that a certain account will be used.

  • To change your defualt account, use the myproject utility on our systems. More information on the myproject utility can be found on our AMS User Interface page.
[NetID@cluster ~]$  myproject -d XXXXXXXXXX
Your default project account now is XXXXXXXXXX.
  • To request a certain account in your job file, add the following line to the directives section of your job file:

On Ada:

#BSUB -P XXXXXXXXXX

On Terra:

#SBATCH -account=XXXXXXXXXX

Q: How do I transfer SUs?

A: To transfer SUs, PIs will need a Small or Large account (see our Account Allocations page for more information). Once an account has been granted to the PI, they can transfer SUs to any of their researchers on our Account Management System (AMS). If a PI needs to add a new researcher, the PI must contact the Help Desk.

Q: How do I get a Guest NetID account for myself or my researchers?

A: Guest NetID accounts are handled by the Identity Management Office. The Guest NetID Account Request Form should be submitted to the Identity Management Office. There are different ways you can submit the form described in the second paragraph of the form.

You will need to specify start and stop affiliation dates on the Guest NetID Request Form. These dates may or may not coincide with HPRC account renewal dates depending on what you list and what gets approved. This Guest NetId Account Request Form is handled by a different department on campus and is separate from the HPRC application. The HPRC application is the one you need to renew with us each year (September 1 - August 31). You must fill out the Guest NetID Account Request Form prior to applying for an HPRC account.

If the person using the Guest NetID intends to user TAMU Wi-Fi or the TAMU VPN (off campus access), those resources must be requested on the Guest NetID Account Request Form.

Ada Batch Processing

Q: Why is my job pending?

A: There can be many reasons why a job would be pending:

  • The job cannot fit on any of our nodes
    • If the job requests more than 245GB of memory per node, without requesting the xlarge queue, it will be stuck pending.
    • SOLUTION: Kill the job and resubmit with less memory or in the xlarge queue. IMPORTANT NOTE: The program MUST use Westmere compatible software to be able to run in the xlarge queue.
    • If the job requests more than 2TB of memory per node, it will be stuck pending.
    • SOLUTION: Kill the job and resubmit with less memory.
    • If the job asks for more than the maximum number of cores per node with #BSUB -R "span[ptile=XX]" it will be stuck pending. On Ada, the maximum number of cores per node is 20 on the regular nodes and 40 on the xlarge nodes.
    • SOLUTION: Kill the job and resubmit with a ptile value less than or equal to the maximum value for the cluster.
  • There are no job slots available
    • If the job requires the usage of the 256GB, 1TB, or 2TB nodes, it might be pending for longer than usual.
    • If the cluster usage is particularly high right now, jobs might be pending for longer than usual. The System Load Levels are available on our Home Page.
  • Your job will run into / through a scheduled maintenance time
    • If your job's wall time schedules your job into / through a scheduled maintenance it will be stuck pending.
    • SOLUTION: Kill the job and resubmit with a wall time which ends before the scheduled maintenance or resubmit after the maintenance has finished.

Q: Why does my job fail?

A: There can be many reasons why a job fails. ALWAYS check the job output file that is created by the batch system and any program output files for information regarding why a job might have failed.

  • Wrong file format
    • If a file has been edited on a Windows computer prior to using it on our clusters, the file may be in the wrong format.
    • TIP: Use the file command to check if the file has CRLF line terminators. If it does, the file is in the wrong format.
    • SOLUTION: Try the dos2unix utility on the file and submit again.
[NetID@cluster ~]$ file myFile
myFile: ASCII English text, with CRLF line terminators
[NetID@cluster ~]$ dos2unix myFile
dos2unix: converting file myFile to UNIX format ...
[NetID@cluster ~]$ file myFile
myFile: ASCII English text
  • The job ran out of time
    • If "TERM_RUNLIMIT" appears in the job output file, the job ran out of time.
    • SOLUTION: Increase the wall time specification #BSUB -W HH:MM and submit again.
  • The job ran out of memory
    • If "TERM_MEMLIMIT" appears in the job output file, the job ran out of memory.
    • SOLUTION: Increase the memory specifications #BSUB -R rusage[mem=XX] and #BSUB -M XX and submit again.
  • Not enough space
    • If "DISK QUOTA EXCEEDED" appears in the output file, there is not enough disk space to complete the job.
    • All users are encouraged to check their quotas regularly with showquota.
    • SOLUTION: See the question below for how to deal with DISK QUOTA EXCEEDED errors.

Q: What if I want to run a program interactively? (GUI)

A: Although most computation on our clusters is done non-interactively, we support several options for interactive programming and visualization.

  • Use Open On Demand
    • Open On Demand is a web-based interface for creating, launching, and visualizing jobs on Ada. There are several applications you can launch via Open On Demand such as ABAQUS and MatLab. You can find Open On Demand at the following URL: https://portal.hprc.tamu.edu
    • You can read more about Open On Demand on the Portal Wiki page: Open On Demand Wiki Page
  • Submit a VNC Job
    • You can submit a VNC job to open a GUI on Ada. There is an in-depth guide on launching Remote Visualization job on Ada: Ada VNC Guide
  • Run from Login Node with X11 forwarding
    • You can launch the GUI of certain applications from the login nodes. Keep in mind the Acceptable Use Policy while running on the login nodes. These limitations include:
      • ONE HOUR of PROCESSING TIME per login session.
      • EIGHT CORES per login session on the same node or (cumulatively) across all login nodes
    • A detailed guide for launching GUI's from the Login nodes can be found at: Access Guide

Q: Why is my program slow?

A: While using one core:

  • Supercomputers ("clusters") are not large single-core entities. A cluster is a collection of CPUs. Each CPU is likely similar to what one would use in most "regular" computers. A huge performance gain should not be expected when using a single core on one of our clusters versus a "regular" computer. In order to see a performance gain, programs and simulations will need to be parallelized to run on multiple cores.

A: While using multiple cores:

  • If a program or simulation is running particularly slowly, it may be experiencing parallel slowdown. This happens when the overhead from communication is greater than the time spent running a program. Trying to further parallelize the program will continue to slow it down.
  • SOLUTION: Reduce the amount of parallelization in the program until the program's "sweet spot" in which it has the most significant speed-up. If no speed-up can be achieved from parallelization, it might be best to run the program serially.
  • IMPORTANT NOTE: If the program or simulation is not written to be parallelized, it will either not work at all or waste SUs.

Terra Batch Processing

Q: Why is my job pending?

A: There can be many reasons why a job would be pending:

  • The job would run over the maximum runtime for the queue
    • If a job asks for more than 7 days, the job will remain pending.
    • If a queue was reqested in the job file and the requested runtime is longer than the maximum of that queue, the job will remain pending.
    • Queue information, including maximum runtime, can be found on our Terra Batch Processing page.
    • SOLUTION: Kill the job and resubmit with a shorter runtime or in a different queue.
  • There are no job slots available
    • If the job requires the usage of the 128GB (GPU) nodes, it might be pending for longer than usual.
    • If the cluster usage is particularly high right now, jobs might be pending for longer than usual. The System Load Levels are available on our Home Page.
  • Your job will run into / through a scheduled maintenance time
    • If your job's wall time schedules your job into / through a scheduled maintenance it will be stuck pending.
    • SOLUTION: Kill the job and resubmit with a wall time which ends before the scheduled maintenance or resubmit after the maintenance has finished.

Q: Why does my job fail?

A: There can be many reasons why a job fails. ALWAYS check the job output file that is created by the batch system and any program output files for information regarding why a job might have failed.

  • Wrong file format
    • If a file has been edited on a Windows computer prior to using it on our clusters, the file may be in the wrong format.
    • TIP: Use the file command to check if the file has CRLF line terminators. If it does, the file is in the wrong format.
    • SOLUTION: Try the dos2unix utility on the file and submit again.
[NetID@cluster ~]$ file myFile
myFile: ASCII English text, with CRLF line terminators
[NetID@cluster ~]$ dos2unix myFile
dos2unix: converting file myFile to UNIX format ...
[NetID@cluster ~]$ file myFile
myFile: ASCII English text
  • The job ran out of time
    • If "CANCELLED ... DUE TO TIME LIMIT" appears in the job output file, the job ran out of time.
    • SOLUTION: Increase the wall time specification #SBATCH -t HH:MM:SS and submit again.
  • The job ran out of memory
    • If "CANCELLED ... DUE TO MEMORY LIMIT" appears in the job output file, the job ran out of memory.
    • SOLUTION: Increase the memory specification #SBATCH --mem=XX and submit again.
  • Not enough space
    • If "DISK QUOTA EXCEEDED" appears in the output file, there is not enough disk space to complete the job.
    • All users are encouraged to check their quotas regularly with showquota.
    • SOLUTION: See the question below for how to deal with DISK QUOTA EXCEEDED errors.

Q: Why is my program slow?

A: While using one core:

  • Supercomputers ("clusters") are not large single-core entities. A cluster is a collection of CPUs. Each CPU is likely similar to what one would use in most "regular" computers. A huge performance gain should not be expected when using a single core on one of our clusters versus a "regular" computer. In order to see a performance gain, programs and simulations will need to be parallelized to run on multiple cores.

A: While using multiple cores:

  • If a program or simulation is running particularly slowly, it may be experiencing parallel slowdown. This happens when the overhead from communication is greater than the time spent running a program. Trying to further parallelize the program will continue to slow it down.
  • SOLUTION: Reduce the amount of parallelization in the program until the program's "sweet spot" in which it has the most significant speed-up. If no speed-up can be achieved from parallelization, it might be best to run the program serially.
  • IMPORTANT NOTE: If the program or simulation is not written to be parallelized, it will either not work at all or waste SUs.

Software

Q: Is [blank] software installed on the clusters?

A: To see if the software is available on the clusters, use module avail:

[NetID@cluster ~]$ module avail [package name]

This will show a list of all the available software matching this name. The command module spider can be used to search for available software:

[NetID@cluster ~]$ module spider [package name]

More information on the module system can be found on our Modules page.

Q: How do I load [blank] software?

A: Our clusters use a module system to manage software. This means that to use the software, the proper modules must be loaded first. To load a module, use module load:

[NetID@cluster ~]$ module load [package name]

Note: The full module name, including the version number, is required to load specific modules. Use module spider to find the full module name.
More information on the module system can be found on our Modules page.

Q: How many [blank] licenses are available?

A: On our clusters we have a license status checker tool in order to see how many licenses are currently in use and how many are available. To check the license status of a certain software, use license_status -s:

[NetID@cluster ~]$ license_status -s [package name]

More information for this tool can be found on our License Checker page.

Q: The software I need is not installed, what can I do?

A: If a particular software is not already installed on the cluster, you can contact us regarding the installation of this software. If the software requires a license which we do not already have, you or your department will need to provide your own license to be able to use the software on the cluster. In general we try to provide as much software as possible for our users. However, this is not always possible, nor is it always possible in a timely manner. If you need a software that is not installed on the cluster, you are also able to install it for yourself on your Scratch directory. However, this is only recommended for experienced users.
Note: We are unable to install Windows only software/packages on our clusters.
Please account for delays in your installation request timeline.

Q: I have a license server for [blank] software, can I use this software on your clusters?

A: If you have a license for a particular software which you would like to use on the clusters, you will need to contact us with that information. We will need the name and version of the software you will be using along with the license file and the host name of your license server.

Q: I do not know how to use [blank] software, can you help me?

A: We have documentation on the software page regarding some of our software, however, these are more for getting started running jobs on the cluster, not necessarily using the software. In a lot of cases, we do not have a lot of experience using the software that is provided on our clusters. That being said, it is often best to consult the user guide of a particular software if you are having trouble using the software. We can always try to provide assistance, but in some cases we will only be able to provide as much help as the user guide for that software provides.

Other

Q: "What is "Disk Quota Exceeded"?

A: This message refers to one or more file quotas being reached.

  • Users are advised to check their quotas regularly with showquota.
  • SOLUTION: Clear out the problem directories of any unnecessary files.
  • More information on file systems and quotas can be found on our file systems page.

Extra Tips:

  • Some files may be hidden or stored deep within your subdirectories.
  • Hidden files can be seen with the ls -la or tree -a commands.
  • The following command will show you the number of files within each top directory: du -a | sed '/.*\.\/.*\/.*/!d' | cut -d/ -f2 | sort | uniq -c | sort -nr
  • The following command will show you the number of files within the current directory (first column, ignore second and third): find . | wc

Q: "Why does my program stop after 1 hour on the login nodes?"

A: Since the login nodes are resources which are constantly shared by many users, we must enforce limits on computing on the login nodes in order to prevent irresponsible usage. One of these limits is on CPU time. Users are limited to ONE HOUR of CPU time per login session. If you need more than one hour of CPU time, you will need to submit a job to the batch system. More information on batch processing can be found on our batch processing page. You are expected to be responsible and courteous to other users when using software on the login nodes.

Q: "How do I acquire an HPRC account for a Texas A&M credit-bearing course?"

A: Details for creating HPRC accounts for use in connection with a Texas A&M University credit-bearing course can be found on our wiki: Hosting a Class with HPRC

Q: "How can I add output to my .bashrc without breaking anything?"

A: Avoid messing with your .bashrc if at all possible. However, if you must add output to your .bashrc to print every time you log into a machine, add the following boolean:

if [ "$SSH_TTY" ]
then
  # Put output here         
fi