HPRC:CommonProblems
Common Problems & Quick Solutions
Contents
- 1 Common Problems & Quick Solutions
- 1.1 Accounts
- 1.1.1 Q: When do accounts expire?
- 1.1.2 Q: How do I get more SUs?
- 1.1.3 Q: I just received my SUs, how can I use them?
- 1.1.4 Q: How do I set my default account?
- 1.1.5 Q: How do I transfer SUs?
- 1.1.6 Q: How long will I be able to access HPRC after losing status as a TAMU student/employee?
- 1.1.7 Q: How do I get a Guest NetID account for myself or my researchers?
- 1.1.8 More information regarding Extended Account Access
- 1.2 Batch Processing
- 1.3 Software
- 1.3.1 Q: Is [blank] software installed on the clusters?
- 1.3.2 Q: How do I load [blank] software?
- 1.3.3 Q: How many [blank] licenses are available?
- 1.3.4 Q: The software I need is not installed, what can I do?
- 1.3.5 Q: I have a license server for [blank] software, can I use this software on your clusters?
- 1.3.6 Q: I do not know how to use [blank] software, can you help me?
- 1.4 Other
- 1.4.1 Q: "What is "Disk Quota Exceeded"?
- 1.4.2 Q: "Why does my program stop after 1 hour on the login nodes?"
- 1.4.3 Q: "How do I acquire an HPRC account for a Texas A&M credit-bearing course?"
- 1.4.4 Q: "How can I add output to my .bashrc without breaking anything?"
- 1.4.5 Q: "How do I set up two-factor authentication?"
- 1.4.6 Q: "My account is locked after failed login attempts (incorrect password, duo related problem), what can I do?"
- 1.4.7 Q: "Application xyz failed to connect to the cluster, what can I do?"
- 1.4.8 Q: "How do I unsubscribe from the TAMU HPRC email list?"
- 1.1 Accounts
Accounts
Q: When do accounts expire?
A: Accounts expire at the start of the new fiscal year (September 1st). You can see when your account expires by going to our Account Management System (AMS) and checking under the Accounts tab.
Q: How do I get more SUs?
A: Students will need to have their PI transfer SUs to them. PIs can apply for up to two Startup accounts, each for up to 200,000 SUs and for not more than 400,000 collective SUs. After this Startup allocation has run out, PIs will need to apply for a Research allocation. More information on the allocation policies can be found on our Account Allocations page.
Q: I just received my SUs, how can I use them?
A: When you have received your SUs, you will need to either change/set your default account or request in your job file that a certain account will be used.
- To change your default account, use the myproject utility on our systems. More information on the myproject utility can be found on our AMS User Interface page. Also see on my project page.
[NetID@cluster ~]$ myproject -d XXXXXXXXXX Your default project account now is XXXXXXXXXX.
- To request a certain account in your job file, add the following line to the directives section of your job file:
#SBATCH -account=XXXXXXXXXX
Q: How do I set my default account?
A: When you have received your SUs, you will need to either change/set your default account or request in your job file that a certain account will be used.
- To change your defualt account, use the myproject utility on our systems. More information on the myproject utility can be found on our AMS User Interface page. Also see on my project page.
[NetID@cluster ~]$ myproject -d XXXXXXXXXX Your default project account now is XXXXXXXXXX.
- To request a certain account in your job file, add the following line to the directives section of your job file:
#SBATCH -account=XXXXXXXXXX
Q: How do I transfer SUs?
A: To transfer SUs, PIs will need a Startup or Research allocation (see our Account Allocations page for more information). Once an account has been granted to the PI, they can transfer SUs to any of their researchers on our Account Management System (AMS). If a PI needs to add a new researcher, the PI must contact the Help Desk.
Q: How long will I be able to access HPRC after losing status as a TAMU student/employee?
A: All critical data should be backed up prior to your TAMU status transitioning in any way. As soon as a system member's status switches to inactive, their NetID is locked. Former employees may be extended the professional courtesy of extended NetID use for up to one year for valid purposes. In the case where extended use of a NetID account is warranted, a sponsor (such as the former employee's department, or former student's professor) must submit a NetID request form. This process is detailed below.
Q: How do I get a Guest NetID account for myself or my researchers?
A: Guest NetID accounts are handled by the Identity Management Office. The Guest NetID Account Request Form should be submitted to the Identity Management Office. There are different ways you can submit the form described in the second paragraph of the form.
You will need to specify start and stop affiliation dates on the Guest NetID Request Form. These dates may or may not coincide with HPRC account renewal dates depending on what you list and what gets approved. This Guest NetId Account Request Form is handled by a different department on campus and is separate from the HPRC application. The HPRC application is the one you need to renew with us each year (September 1 - August 31). You must fill out the Guest NetID Account Request Form prior to applying for an HPRC account.
If the person using the Guest NetID intends to use TAMU Wi-Fi or the TAMU VPN (off campus access), those resources must be requested on the Guest NetID Account Request Form.
More information regarding Extended Account Access
For more information regarding extension of netID, please refer to this document. Information regarding Extended Account Access can be found on page 13.
Batch Processing
Q: Why is my job pending?
A: There can be many reasons why a job would be pending:
- The job would run over the maximum runtime for the queue
- If a job asks for more than 7 days, the job will remain pending.
- If a queue was reqested in the job file and the requested runtime is longer than the maximum of that queue, the job will remain pending.
- Queue information, including maximum runtime, can be found on our Terra, Grace, or FASTER batch processing pages.
- SOLUTION: Kill the job and resubmit with a shorter runtime or in a different queue.
- There are no job slots available
- If the job requires the usage of GPU nodes, it might be pending for longer than usual.
- If the cluster usage is particularly high right now, jobs might be pending for longer than usual. The System Load Levels are available on our Home Page.
- Your job will run into / through a scheduled maintenance time. Check the HPRC web site for any scheduled maintenance.
- If your job's wall time schedules your job into / through a scheduled maintenance it will be stuck pending.
- SOLUTION: Kill the job and resubmit with a wall time which ends before the scheduled maintenance or resubmit after the maintenance has finished.
Q: Why does my job fail?
A: There can be many reasons why a job fails. ALWAYS check the job output file that is created by the batch system and any program output files for information regarding why a job might have failed.
- Wrong file format
- If a file has been edited on a Windows computer prior to using it on our clusters, the file may be in the wrong format.
- TIP: Use the file command to check if the file has CRLF line terminators. If it does, the file is in the wrong format.
- SOLUTION: Try the dos2unix utility on the file and submit again.
[NetID@cluster ~]$ file myFile myFile: ASCII English text, with CRLF line terminators [NetID@cluster ~]$ dos2unix myFile dos2unix: converting file myFile to UNIX format ... [NetID@cluster ~]$ file myFile myFile: ASCII English text
- The job ran out of time
- If "CANCELLED ... DUE TO TIME LIMIT" appears in the job output file, the job ran out of time.
- SOLUTION: Increase the wall time specification #SBATCH -t HH:MM:SS and submit again.
- The job ran out of memory
- If "CANCELLED ... DUE TO MEMORY LIMIT" appears in the job output file, the job ran out of memory.
- SOLUTION: Increase the memory specification #SBATCH --mem=XX and submit again.
- Not enough space
- If "DISK QUOTA EXCEEDED" appears in the output file, there is not enough disk space to complete the job.
- All users are encouraged to check their quotas regularly with showquota.
- SOLUTION: See the question below for how to deal with DISK QUOTA EXCEEDED errors.
Q: Why is my job unable to reach the Internet?
A: For policy reasons, we do not allow cluster compute nodes to communicate with the Internet. If you need to download something from the Internet (such as downloading software or cloning a git repo), you must do so from a cluster login node.
Q: What if I want to run a program interactively? (GUI)
A: Although most computation on our clusters is done non-interactively, we support several options for interactive programming and visualization.
- Use Open OnDemand (Recommended Method)
- Open OnDemand is a web-based interface for creating, launching, and visualizing jobs on HPC systems. There are several applications you can launch via Open OnDemand such as ABAQUS, MatLab, and RStudio. You can find Open OnDemand at the following URL: https://portal.hprc.tamu.edu
- You can read more about Open OnDemand on the Portal Wiki page: Open OnDemand Wiki Page
- Submit a VNC Job
- You can submit a VNC job to open a GUI on Terra, Grace, or FASTER. There is an in-depth guide on launching Remote Visualization jobs on here.
- Run from Login Node with X11 forwarding
- You can launch the GUI of certain applications from the login nodes. Keep in mind the Acceptable Use Policy while running on the login nodes. These limitations include:
- ONE HOUR of PROCESSING TIME per login session.
- EIGHT CORES per login session on the same node or (cumulatively) across all login nodes
- A detailed guide for launching GUI's from the Login nodes can be found at: Access Guide
- You can launch the GUI of certain applications from the login nodes. Keep in mind the Acceptable Use Policy while running on the login nodes. These limitations include:
Q: Why is my program slow?
A: While using one core:
- Supercomputers ("clusters") are not large single-core entities. A cluster is a collection of CPUs. Each CPU is likely similar to what one would use in most "regular" computers. A huge performance gain should not be expected when using a single core on one of our clusters versus a "regular" computer. In order to see a performance gain, programs and simulations will need to be parallelized to run on multiple cores.
A: While using multiple cores:
- If a program or simulation is running particularly slowly, it may be experiencing parallel slowdown. This happens when the overhead from communication is greater than the time spent running a program. Trying to further parallelize the program will continue to slow it down.
- SOLUTION: Reduce the amount of parallelization in the program until the program's "sweet spot" in which it has the most significant speed-up. If no speed-up can be achieved from parallelization, it might be best to run the program serially.
- IMPORTANT NOTE: If the program or simulation is not written to be parallelized, it will either not work at all and/or waste SUs.
Software
Q: Is [blank] software installed on the clusters?
A: To see if the software is available on the clusters, use module avail:
[NetID@cluster ~]$ module avail [package name]
This will show a list of all the available software matching this name. The command module spider can be used to search for available software:
[NetID@cluster ~]$ module spider [package name]
More information on the module system can be found on our Modules page.
Q: How do I load [blank] software?
A: Our clusters use a module system to manage software. This means that to use the software, the proper modules must be loaded first. To load a module, use module load:
[NetID@cluster ~]$ module load [package name]
Note: The full module name, including the version number, is required to load specific modules. Use module spider to find the full module name.
More information on the module system can be found on our Modules page.
Q: How many [blank] licenses are available?
A: On our clusters we have a license status checker tool in order to see how many licenses are currently in use and how many are available. To check the license status of a certain software, use license_status -s:
[NetID@cluster ~]$ license_status -s [package name]
More information for this tool can be found on our License Checker page.
Q: The software I need is not installed, what can I do?
A: If a particular software is not already installed on the cluster, you can contact us regarding the installation of this software. If the software requires a license which we do not already have, you or your department will need to provide your own license to be able to use the software on the cluster. In general we try to provide as much software as possible for our users. However, this is not always possible, nor is it always possible in a timely manner. If you need a software that is not installed on the cluster, you are also able to install it for yourself on your Scratch directory. However, this is only recommended for experienced users.
Note: We are unable to install Windows only software/packages on our clusters.
Please account for delays in your installation request timeline.
Q: I have a license server for [blank] software, can I use this software on your clusters?
A: If you have a license for a particular software which you would like to use on the clusters, you will need to contact us with that information. We will need the name and version of the software you will be using along with the license file and the host name of your license server.
Q: I do not know how to use [blank] software, can you help me?
A: We have documentation on the software page regarding some of our software, however, these are more for getting started running jobs on the cluster, not necessarily using the software. In a lot of cases, we do not have a lot of experience using the software that is provided on our clusters. That being said, it is often best to consult the user guide of a particular software if you are having trouble using the software. We can always try to provide assistance, but in some cases we will only be able to provide as much help as the user guide for that software provides.
Other
Q: "What is "Disk Quota Exceeded"?
A: This message refers to one or more file quotas being reached.
- Users are advised to check their quotas regularly with showquota.
- SOLUTION: Clear out the problem directories of any unnecessary files.
- More information on file systems and quotas can be found on our Terra, Grace , and FASTER file system pages.
Extra Tips:
- Some files may be hidden or stored deep within your subdirectories.
- Hidden files can be seen with the ls -la or tree -a commands.
- The following commands will show you the number of files within each top directory:
% ml mpifileutils/0.9.1-intel-2019a % dwalk mydir [2020-06-03T17:33:49] Walking /scratch/user/netid/mydir [2020-06-03T17:33:49] Walked 896 items in 0.015395 seconds (58200.092672 files/sec) [2020-06-03T17:33:49] Items: 896 [2020-06-03T17:33:49] Directories: 12 [2020-06-03T17:33:49] Files: 884 [2020-06-03T17:33:49] Links: 0 [2020-06-03T17:33:49] Data: 115.775 GB (134.110 MB per file)
- The following command will show you the number of files within the current directory (first column, ignore second and third): find . | wc
Q: "Why does my program stop after 1 hour on the login nodes?"
A: Since the login nodes are resources which are constantly shared by many users, we must enforce limits on computing on the login nodes in order to prevent irresponsible usage. One of these limits is on CPU time. Users are limited to ONE HOUR of CPU time per login session. If you need more than one hour of CPU time, you will need to submit a job to the batch system. More information on batch processing can be found on our batch processing page. You are expected to be responsible and courteous to other users when using software on the login nodes.
Q: "How do I acquire an HPRC account for a Texas A&M credit-bearing course?"
A: Details for creating HPRC accounts for use in connection with a Texas A&M University credit-bearing course can be found on our wiki: Hosting a Class with HPRC
Q: "How can I add output to my .bashrc without breaking anything?"
A: Avoid messing with your .bashrc if at all possible. However, if you must add output to your .bashrc to print every time you log into a machine, add the following boolean:
if [ "$SSH_TTY" ] then # Put output here fi
Q: "How do I set up two-factor authentication?"
A: In order to set up two-factor authentication (needed to access the TAMU VPN and log into CAS), go to https://duo.tamu.edu/. There will be an option to "Enroll/Manage Devices". Clicking this option will bring you to the CAS login page, where you enter your NetID and password. Next, you will add a device to use for authentication. Enter the requested information. Download the Duo Mobile app (Google Play/Apple App Store) on your registered device and log in. Now, whenever you log into a device through CAS, a request will be sent to your registered device asking for approval of the log in.
A: Your account will be locked if you failed to provide the correct password after 7 attempts. It will be unlocked after 10 minutes. You are also able to specify a different login node. Please be aware that certain MobaXterm setups might cause problems when duo authentication is enabled. Please refer to (Two Factor Authentication wiki page) for more information.
Q: "Application xyz failed to connect to the cluster, what can I do?"
A: Starting November 4th, 2019, Duo authentication is required for every SSH login request. This has been known to cause issues with certain applications. As of this writing, we don't have a complete list of which applications will have issues when duo authentication is enabled. Please send us an email at help@hprc.tamu.edu when you encounter login-related issues with the software you are using. We will try our best to assist you.
Q: "How do I unsubscribe from the TAMU HPRC email list?"
A: As of right now, the only way to unsubscribe from our mailing list is to deactivate your HPRC account.