Hprc banner tamu.png

Difference between revisions of "Grace:QuickStart"

From TAMU HPRC
Jump to: navigation, search
(created)
(Reliably Transferring Large Files)
 
(23 intermediate revisions by 7 users not shown)
Line 1: Line 1:
 
<H1>Grace Quick Start Guide</H1>
 
<H1>Grace Quick Start Guide</H1>
 
__TOC__
 
__TOC__
 +
== Deployment Status ==
 +
 +
<font color=red>'''Cluster deployed, currently in testing and early user access mode.</font>
 +
 
== Grace Usage Policies ==
 
== Grace Usage Policies ==
 
'''Access to Grace is granted with the condition that you will understand and adhere to all TAMU HPRC and Grace-specific policies.'''  
 
'''Access to Grace is granted with the condition that you will understand and adhere to all TAMU HPRC and Grace-specific policies.'''  
  
 
General policies can be found on the [https://hprc.tamu.edu/policies/ HPRC Policies page].
 
General policies can be found on the [https://hprc.tamu.edu/policies/ HPRC Policies page].
 
Grace-specific policies, which are similar to Terra, can be found on the [[Grace:Policies | Grace Policies page]].
 
  
 
== Accessing Grace ==
 
== Accessing Grace ==
Line 12: Line 14:
 
<!--'''For convenience, this topic has been summarized in a short video lesson, which you can view [https://www.youtube.com/watch?v=dypaj5uHpqQ here]. Otherwise, feel free to continue reading.'''-->
 
<!--'''For convenience, this topic has been summarized in a short video lesson, which you can view [https://www.youtube.com/watch?v=dypaj5uHpqQ here]. Otherwise, feel free to continue reading.'''-->
  
<!--Most access to Grace is done via a secure shell session.
+
<!--Most access to Terra is done via a secure shell session.
  
 
Users on '''Windows''' computers use either [http://www.putty.org/ PuTTY] or [http://mobaxterm.mobatek.net/ MobaXterm]. If MobaXterm works on your computer, it is usually easier to use.
 
Users on '''Windows''' computers use either [http://www.putty.org/ PuTTY] or [http://mobaxterm.mobatek.net/ MobaXterm]. If MobaXterm works on your computer, it is usually easier to use.
Line 19: Line 21:
  
 
The command to connect to Grace is as follows. Be sure to replace [NetID] with your TAMU NetID.  
 
The command to connect to Grace is as follows. Be sure to replace [NetID] with your TAMU NetID.  
  [user1@localhost ~]$ '''ssh ''NetID''@Grace.tamu.edu'''
+
  [user1@localhost ~]$ '''ssh ''NetID''@grace.hprc.tamu.edu'''
 
<font color=teal>'''Note:''' In this example ''[user1@localhost ~]$'' represents the command prompt on your local machine.</font> <br>
 
<font color=teal>'''Note:''' In this example ''[user1@localhost ~]$'' represents the command prompt on your local machine.</font> <br>
 
Your login password is the same that used on [https://howdy.tamu.edu/ Howdy]. You will not see your password as your type it into the login prompt.-->
 
Your login password is the same that used on [https://howdy.tamu.edu/ Howdy]. You will not see your password as your type it into the login prompt.-->
Line 25: Line 27:
 
Most access to Grace is done via a secure shell session. In addition, '''two-factor authentication''' is required to login to any cluster.  
 
Most access to Grace is done via a secure shell session. In addition, '''two-factor authentication''' is required to login to any cluster.  
  
Users on '''Windows''' computers use either [http://www.putty.org/ PuTTY] or [http://mobaxterm.mobatek.net/ MobaXterm]. If MobaXterm works on your computer, it is usually easier to use. When starting an ssh session in PuTTY, choose the connection type 'SSH', select port 22, and then type the hostname 'Grace.tamu.edu'. For MobaXterm, select 'Session', 'SSH', and then remote host 'Grace.tamu.edu'. Check the box to specify username and type your NetID. After selecting 'Ok', you will be prompted for Duo Two Factor Authentication. For more detailed instructions, visit the [https://hprc.tamu.edu/wiki/Two_Factor#MobaXterm Two Factor Authentication] page.
+
Users on '''Windows''' computers use either [http://www.putty.org/ PuTTY] or [http://mobaxterm.mobatek.net/ MobaXterm]. If MobaXterm works on your computer, it is usually easier to use. When starting an ssh session in PuTTY, choose the connection type 'SSH', select port 22, and then type the hostname 'grace.hprc.tamu.edu'. For MobaXterm, select 'Session', 'SSH', and then remote host 'grace.hprc.tamu.edu'. Check the box to specify username and type your NetID. After selecting 'Ok', you will be prompted for Duo Two Factor Authentication. For more detailed instructions, visit the [https://hprc.tamu.edu/wiki/Two_Factor#MobaXterm Two Factor Authentication] page.
  
 
Users on '''Mac''' and '''Linux/Unix''' should use whatever SSH-capable terminal is available on their system. The command to connect to Grace is as follows. Be sure to replace [NetID] with your TAMU NetID.  
 
Users on '''Mac''' and '''Linux/Unix''' should use whatever SSH-capable terminal is available on their system. The command to connect to Grace is as follows. Be sure to replace [NetID] with your TAMU NetID.  
  [user1@localhost ~]$ '''ssh ''[NetID]''@Grace.tamu.edu'''
+
  [user1@localhost ~]$ '''ssh ''[NetID]''@grace.hprc.tamu.edu'''
 
<font color=teal>'''Note:''' In this example ''[user1@localhost ~]$'' represents the command prompt on your local machine.</font> <br>
 
<font color=teal>'''Note:''' In this example ''[user1@localhost ~]$'' represents the command prompt on your local machine.</font> <br>
 
Your login password is the same that used on [https://howdy.tamu.edu/ Howdy]. You will not see your password as your type it into the login prompt.
 
Your login password is the same that used on [https://howdy.tamu.edu/ Howdy]. You will not see your password as your type it into the login prompt.
Line 43: Line 45:
  
 
You can navigate to your ''home'' directory with the following command:
 
You can navigate to your ''home'' directory with the following command:
  [NetID@Grace1 ~]$ '''cd /home/''NetID'''''
+
  [NetID@grace1 ~]$ '''cd /home/''NetID'''''
  
 
Your ''scratch'' directory has more storage space than your ''home'' directory and is recommended for general purpose use. You can navigate to your ''scratch'' directory with the following command:
 
Your ''scratch'' directory has more storage space than your ''home'' directory and is recommended for general purpose use. You can navigate to your ''scratch'' directory with the following command:
  [NetID@Grace1 ~]$ '''cd /scratch/user/''NetID'''''
+
  [NetID@grace1 ~]$ '''cd /scratch/user/''NetID'''''
  
 
You can navigate to ''scratch'' or ''home'' easily by using their respective environment variables.  
 
You can navigate to ''scratch'' or ''home'' easily by using their respective environment variables.  
  
 
Navigate to ''scratch'' with the following command:
 
Navigate to ''scratch'' with the following command:
  [NetID@Grace1 ~]$ '''cd $SCRATCH'''
+
  [NetID@grace1 ~]$ '''cd $SCRATCH'''
  
 
Navigate to ''home'' with the following command:
 
Navigate to ''home'' with the following command:
  [NetID@Grace1 ~]$ '''cd $HOME'''
+
  [NetID@grace1 ~]$ '''cd $HOME'''
  
 
<font color=purple>
 
<font color=purple>
Line 63: Line 65:
  
 
You can see the current status of your storage quotas with:
 
You can see the current status of your storage quotas with:
  [NetID@Grace1 ~]$ '''showquota'''
+
  [NetID@grace1 ~]$ '''showquota'''
  
 
If you need a storage quota increase, please contact us with justification and the expected length of time that you will need the quota increase.
 
If you need a storage quota increase, please contact us with justification and the expected length of time that you will need the quota increase.
  
==The Batch System==
+
== Transferring Files ==
The batch system is a load distribution implementation that ensures convenient and fair use of a shared resource. Submitting jobs to a batch system allows a user to reserve specific resources with minimal interference to other users. All users are required to submit resource-intensive processing to the compute nodes through the batch system - <font color=red> attempting to circumvent the batch system is not allowed.</font>
+
 
 +
Files can be transferred to Grace using the ''scp'' command or a file transfer program.
 +
 
 +
Our users most commonly utilize:
 +
* [https://winscp.net/eng/download.php WinSCP] - Straightforward, legacy
 +
* [https://filezilla-project.org/ FileZilla Client] - Easy to use, additional features, available on most platforms
 +
* [https://mobaxterm.mobatek.net/features.html MobaXterm Graphical SFTP] - Included with MobaXterm
 +
 
 +
<font color=teal>'''Advice:''' while GUIs are acceptable for file transfers, the cp and scp commands are much quicker and may significantly benefit your workflow.</font>
  
On Grace, '''Slurm''' is the batch system that provides job management.
+
=== Reliably Transferring Large Files ===
More information on '''Slurm''' can be found in the [[Grace:Batch | Grace Batch]] page.
+
 
 +
For files larger than several GB, you will want to consider the use of a more fault-tolerant utility such as rsync.
 +
[NetID@grace1 ~]$ '''rsync -av [-z] ''localdir/ userid@remotesystem:/path/to/remotedir/'''''
 +
 
 +
<!-- See our [Terra-rsync example video] for a demonstration of this process. -->
 +
<!-- [Insert info on glob, ftn] -->
  
 
== Managing Project Accounts ==
 
== Managing Project Accounts ==
Line 78: Line 93:
 
== Finding Software ==
 
== Finding Software ==
  
Software on Grace is loaded using '''modules'''.
+
Software on Grace is loaded using '''hierarchical modules'''.
  
 
A list of the most popular software on our systems is available on the [[:SW | HPRC Available Software]] page.
 
A list of the most popular software on our systems is available on the [[:SW | HPRC Available Software]] page.
  
To '''find''' ''most'' available software on Grace, use the following command:
+
To list all software installed as a module on Grace, use the mla utility:
  [NetID@Grace1 ~]$ '''module avail'''
+
[NetID@grace1 ~]$ '''mla'''
 +
 
 +
To search for a specific piece of software installed as a module on Grace using the mla utility:
 +
  [NetID@grace1 ~]$ '''mla ''keyword'''''
  
 
To '''search for''' particular software by keyword, use:
 
To '''search for''' particular software by keyword, use:
  [NetID@Grace1 ~]$ '''module spider ''keyword'''''
+
  [NetID@grace1 ~]$ '''module spider ''keyword'''''
  
To load a module, use:
+
To see how to load a module, use the full module name:
  [NetID@Grace1 ~]$ '''module load ''moduleName'''''
+
  [NetID@grace1 ~]$ '''module spider ''Perl/5.32.0'''''
 +
 
 +
You will see a message like the following
 +
<pre>
 +
You will need to load all module(s) on any one of the lines below before the "Perl/5.32.0" module is available to load.
 +
 
 +
      GCCcore/10.2.0
 +
</pre>
 +
 
 +
Load the base dependency module(s) first then the full module name
 +
[NetID@grace1 ~]$ '''module load ''GCCcore/10.2.0  Perl/5.32.0'''''
  
 
To list all currently loaded modules, use:
 
To list all currently loaded modules, use:
  [NetID@Grace1 ~]$ '''module list'''
+
  [NetID@grace1 ~]$ '''module list'''
 +
 
 +
To see what other modules can be loaded with the base dependency module (for example when GCCcore/10.2.0 is loaded)
 +
[NetID@grace1 ~]$ '''module avail'''
  
 
To remove all currently loaded modules, use:
 
To remove all currently loaded modules, use:
  [NetID@Grace1 ~]$ '''module purge'''
+
  [NetID@grace1 ~]$ '''module purge'''
  
 
If you need '''new software''' or '''an update''', please contact us with your request.  
 
If you need '''new software''' or '''an update''', please contact us with your request.  
Line 103: Line 134:
 
<font color=teal>Please account for '''delays''' in your installation request timeline. </font>
 
<font color=teal>Please account for '''delays''' in your installation request timeline. </font>
  
 +
== Running Your Program / Preparing a Job File ==
 +
 +
In order to properly run a program on Grace, you will need to create a job file and submit a job to the batch system. The batch system is a load distribution implementation that ensures convenient and fair use of a shared resource. Submitting jobs to a batch system allows a user to reserve specific resources with minimal interference to other users. All users are required to submit resource-intensive processing to the compute nodes through the batch system - <font color=red> attempting to circumvent the batch system is not allowed.</font>
  
== Running Your Program / Preparing a Job File ==
+
On Grace, '''Slurm''' is the batch system that provides job management.
 +
More information on '''Slurm''' can be found in the [[Grace:Batch | Grace Batch]] page.
  
In order to properly run a program on Grace, you will need to create a job file and submit a job.
 
  
The simple example job file below requests 1 core on 1 node with 2.5GB of RAM for 1.5 hours. Note that typical nodes on Grace have 28 cores with 120GB of usable memory and ensure that your job requirements will fit within these restrictions. Any modules that need to be loaded or executable commands will replace the ''"#First Executable Line"'' in this example.
+
The simple example job file below requests 1 core on 1 node with 2.5GB of RAM for 1.5 hours. '''Note that typical nodes on Grace have 48 cores with 384 GB of usable memory and ensure that your job requirements will fit within these restrictions.''' Any modules that need to be loaded or executable commands will replace the ''"#First Executable Line"'' in this example.
 
  #!/bin/bash
 
  #!/bin/bash
 
  ##ENVIRONMENT SETTINGS; CHANGE WITH CAUTION
 
  ##ENVIRONMENT SETTINGS; CHANGE WITH CAUTION
Line 126: Line 160:
 
Note: If your job file has been written on an older Mac or DOS workstation, you will need to use "dos2unix" to remove certain characters that interfere with parsing the script.
 
Note: If your job file has been written on an older Mac or DOS workstation, you will need to use "dos2unix" to remove certain characters that interfere with parsing the script.
  
  [NetID@Grace1 ~]$ '''dos2unix ''MyJob.slurm'''''
+
  [NetID@grace1 ~]$ '''dos2unix ''MyJob.slurm'''''
  
 
More information on '''job options''' can be found in the [[Grace:Batch#Building_Job_Files | Building Job Files]] section of the [[Grace:Batch | Grace Batch]] page.
 
More information on '''job options''' can be found in the [[Grace:Batch#Building_Job_Files | Building Job Files]] section of the [[Grace:Batch | Grace Batch]] page.
Line 134: Line 168:
 
== Submitting and Monitoring Jobs ==
 
== Submitting and Monitoring Jobs ==
 
Once you have your job file ready, it is time to submit your job. You can submit your job to slurm with the following command:
 
Once you have your job file ready, it is time to submit your job. You can submit your job to slurm with the following command:
  [NetID@Grace1 ~]$ '''sbatch ''MyJob.slurm'''''
+
  [NetID@grace1 ~]$ '''sbatch ''MyJob.slurm'''''
 
  Submitted batch job 3606
 
  Submitted batch job 3606
  
 
After the job has been submitted, you are able to monitor it with several methods.  
 
After the job has been submitted, you are able to monitor it with several methods.  
 
To see the status of all of your jobs, use the following command:
 
To see the status of all of your jobs, use the following command:
  [NetID@Grace1 ~]$ '''squeue -u ''NetID'''''
+
  [NetID@grace1 ~]$ '''squeue -u ''NetID'''''
 
  JOBID      NAME                USER                    PARTITION  NODES CPUS STATE      TIME        TIME_LEFT  START_TIME          REASON      NODELIST             
 
  JOBID      NAME                USER                    PARTITION  NODES CPUS STATE      TIME        TIME_LEFT  START_TIME          REASON      NODELIST             
 
  3606        myjob2              NetID                  short      1    3    RUNNING    0:30        00:10:30    2016-11-27T23:44:12  None        tnxt-[0340]   
 
  3606        myjob2              NetID                  short      1    3    RUNNING    0:30        00:10:30    2016-11-27T23:44:12  None        tnxt-[0340]   
  
 
To see the status of one job, use the following command, where ''XXXX'' is the JobID:
 
To see the status of one job, use the following command, where ''XXXX'' is the JobID:
  [NetID@Grace1 ~]$ '''squeue --job ''XXXX'''''
+
  [NetID@grace1 ~]$ '''squeue --job ''XXXX'''''
 
  JOBID      NAME                USER                    PARTITION  NODES CPUS STATE      TIME        TIME_LEFT  START_TIME          REASON      NODELIST             
 
  JOBID      NAME                USER                    PARTITION  NODES CPUS STATE      TIME        TIME_LEFT  START_TIME          REASON      NODELIST             
 
  XXXX        myjob2              NetID                  short      1    3    RUNNING    0:30        00:10:30    2016-11-27T23:44:12  None        tnxt-[0340]   
 
  XXXX        myjob2              NetID                  short      1    3    RUNNING    0:30        00:10:30    2016-11-27T23:44:12  None        tnxt-[0340]   
  
 
To cancel a job, use the following command, where ''XXXX'' is the JobID:
 
To cancel a job, use the following command, where ''XXXX'' is the JobID:
  [NetID@Grace1 ~]$ '''scancel ''XXXX'''''
+
  [NetID@grace1 ~]$ '''scancel ''XXXX'''''
  
 
More information on [[:Grace:Batch#Job_Submission | Job Submission]] and [[:Grace:Batch#Job_Monitoring_and_Control_Commands | Job Monitoring]] Slurm jobs can be found at the [[:Grace:Batch | Grace Batch System]] page.
 
More information on [[:Grace:Batch#Job_Submission | Job Submission]] and [[:Grace:Batch#Job_Monitoring_and_Control_Commands | Job Monitoring]] Slurm jobs can be found at the [[:Grace:Batch | Grace Batch System]] page.
Line 155: Line 189:
 
== tamubatch ==
 
== tamubatch ==
  
'''tamubatch''' is an automatic batch job script that submits jobs for the user without the need of writing a batch script on the Terra and Grace clusters. The user just needs to provide the executable commands in a text file and tamubatch will automatically submit the job to the cluster. There are flags that the user may specify which allows control over the parameters for the job submitted.
+
'''tamubatch''' is an automatic batch job script that submits jobs for the user without the need of writing a batch script on the clusters. The user just needs to provide the executable commands in a text file and tamubatch will automatically submit the job to the cluster. There are flags that the user may specify which allows control over the parameters for the job submitted.
  
''tamubatch is still in beta and has not been fully developed. Although there are still bugs and testing issues that are currently being worked on, tamubatch can already submit jobs to both the Terra and Grace clusters if given a file of executable commands. ''
+
''tamubatch is still in beta and has not been fully developed. Although there are still bugs and testing issues that are currently being worked on, tamubatch can already submit jobs to both the clusters if given a file of executable commands. ''
  
 
For more information, visit [https://hprc.tamu.edu/wiki/SW:tamubatch this page.]
 
For more information, visit [https://hprc.tamu.edu/wiki/SW:tamubatch this page.]
  
== Additional Topics ==
+
== Graphic User Interfaces (Visualization) ==
  
=== Translating Ada/LSF <--> Grace/Slurm ===
+
The use of GUIs on Grace is a more complicated process than running non-interactive jobs or doing resource-light interactive processing.
  
The [[:HPRC:Batch_Translation | HPRC Batch Translation]] page contains information on '''converting''' between LSF, PBS, and Slurm.
+
You have '''two options''' for using GUIs on Grace.
  
Our staff has also written some example jobs for specific software. These software-specific examples can be seen on the [[:SW | Individual Software Pages]] where available.
+
The '''first option''' is to use the [https://portal.hprc.tamu.edu/ Open On Demand Portal], which is a web interface to our clusters. Users must be connected to the campus network either directly or via VPN to access the portal. More information can be found [https://hprc.tamu.edu/wiki/SW:Portal here], or on our [https://www.youtube.com/watch?v=dqa2ZzsEmQs&list=PLHR4HLly3i4aJJDxKTZIpxyJG6uSqgAgd YouTube channel]
  
=== Finding Software ===
+
The '''second option''' is to run on the login node. When doing this, you '''must''' observe the fair-use policy of login node usage. Users commonly violate these policies by accident, resulting in terminated processes, confusion, and warnings from our admins.
  
Software on Grace is loaded using '''modules'''.
+
== Deep Learning with TensorFlow and PyTorch ==
  
You can see the most popular software on the [[:SW | HPRC Available Software]] page.
+
'''Installing Python venv'''
  
You can '''find''' ''most'' available software on Grace with the following command:
+
  # load all the required modules
  [NetID@Grace1 ~]$ '''module avail'''
+
  ml purge
 +
         
 +
  # CUDA modules are needed for TensorFlow
 +
  ml GCCcore/9.3.0 GCC/9.3.0 Python/3.8.2 CUDAcore/11.0.2 CUDA/11.0.2 cuDNN/8.0.5.39-CUDA-11.0.2
 +
         
 +
  # As Pytorch comes with CUDA libraries, we don't need to load CUDA modules.
 +
  # the following two modules are sufficient for PyTorch
 +
  # ml GCCcore/10.2.0 Python/3.8.6
 +
         
 +
  # you can save your module list with (dl is an arbitrary name)
 +
  module save dl
 +
       
 +
  # next time when you login you can simply run
 +
  module restore dl
 +
     
 +
  # create a virtual environment (the name dlvenv is arbitrary)
 +
  cd $SCRATCH
 +
  python -m venv dlvenv
 +
  source dlvenv/bin/activate
 +
   
 +
'''Installing Python packages'''
  
You can '''search for''' particular software by keyword using:
+
  # First upgrade pip to avoid warning messages
[NetID@Grace1 ~]$ '''module spider ''keyword'''''
+
  pip install -U pip
 +
     
 +
  # You can watch GPU usage on another terminal with
 +
  watch -n 0.5 nvidia-smi
  
You can load a module using:
+
TensorFlow
[NetID@Grace1 ~]$ '''module load ''moduleName'''''
 
  
You can list all currently loaded modules using:
+
  # install TensorFlow and other packages as needed.
[NetID@Grace1 ~]$ '''module list'''
+
  pip install tensorflow
 +
 
 +
  # Try things out (note that the login nodes grace4 and grace5 don't have GPUs)
 +
  python -c "import tensorflow as tf; print(tf.test.gpu_device_name())"
  
You can remove all currently loaded modules using:
+
PyTorch
[NetID@Grace1 ~]$ '''module purge'''
 
  
If you need '''new software''' or '''an update''', please contact us with your request.
+
  pip install torch torchvision
 +
 
 +
  # Try things out (note that the login nodes grace4 and grace5 don't have GPUs)
 +
  python -c "import torch; print(torch.device('cuda:0' if torch.cuda.is_available() else 'cpu'))"
  
There are restrictions on what software we can install. There is also regularly a queue of requested software installations.  
+
All other python packages could be installed with '''pip install''' accordingly.
  
<font color=teal>Please account for '''delays''' in your installation request timeline. </font>
+
'''Sample Job Script - dl.slurm'''
  
=== Transferring Files ===
+
  #!/bin/bash
 +
  ##ENVIRONMENT SETTINGS; CHANGE WITH CAUTION
 +
  #SBATCH --export=NONE            #Do not propagate environment
 +
  #SBATCH --get-user-env=L        #Replicate login environment
 +
       
 +
  ##NECESSARY JOB SPECIFICATIONS
 +
  #SBATCH --job-name=JobExample4  #Set the job name to "JobExample4"
 +
  #SBATCH --time=00:30:00          #Set the wall clock limit to 1hr and 30min
 +
  #SBATCH --ntasks=1              #Request 1 task
 +
  #SBATCH --mem=2560M              #Request 2560MB (2.5GB) per task
 +
  #SBATCH --output=Example4Out.%j  #Send stdout/err to "Example4Out.[jobID]"
 +
  #SBATCH --gres=gpu:1            #Request 1 GPU per node can be 1 or 2
 +
  #SBATCH --partition=gpu          #Request the GPU partition/queue
 +
     
 +
  # modules needed for running DL jobs. Module restore will also work
 +
  #module restore dl
 +
  ml GCCcore/9.3.0 GCC/9.3.0 Python/3.8.2 CUDAcore/11.0.2 CUDA/11.0.2 cuDNN/8.0.5.39-CUDA-11.0.2
 +
     
 +
  # Python venv
 +
  source $SCRATCH/dlvenv/bin/activate
 +
   
 +
  # scripts or executables
 +
  cd $SCRACTH/mywonderfulproject
  
Files can be transferred to Grace using the ''scp'' command or a file transfer program.
+
  python TuringTest.py
  
Our users most commonly utilize:
+
'''Submit your slurm job with sbatch'''
* [https://winscp.net/eng/download.php WinSCP] - Straightforward, legacy
 
* [https://filezilla-project.org/ FileZilla Client] - Easy to use, additional features, available on most platforms
 
* [https://mobaxterm.mobatek.net/features.html MobaXterm Graphical SFTP] - Included with MobaXterm
 
 
 
See our [Grace-Filezilla example video] for a demonstration of this process.
 
 
 
<font color=teal>'''Advice:''' while GUIs are acceptable for file transfers, the cp and scp commands are much quicker and may significantly benefit your workflow.</font>
 
 
 
==== Reliably Transferring Large Files ====
 
 
 
For files larger than several GB, you will want to consider the use of a more fault-tolerant utility such as rsync.
 
[NetID@Grace1 ~]$ '''rsync -av [-z] ''localdir/ userid@remotesystem:/path/to/remotedir/'''''
 
 
 
An rsync example can be seen on the [[:Ada:Fast_Data_Transfer#Data_transfer_using_rsync | Ada Fast Transfer]] page.
 
<!-- See our [Grace-rsync example video] for a demonstration of this process. -->
 
<!-- [Insert info on glob, ftn] -->
 
 
 
=== Graphic User Interfaces (Visualization) ===
 
 
 
The use of GUIs on Grace is a more complicated process than running non-interactive jobs or doing resource-light interactive processing.
 
 
 
You have '''two options''' for using GUIs on Grace.
 
  
The '''first option''' is to run on the login node. When doing this, you '''must''' observe the fair-use policy of login node usage. Users commonly violate these policies by accident, resulting in terminated processes, confusion, and warnings from our admins.
+
  sbatch dl.slurm
  
The '''second option''' is to use a VNC job. This method is outside the scope of this guide. See the [[Grace:Remote-Viz | Grace Remote Visualization]] page for more information.
 
  
 
[[Category: Grace]]
 
[[Category: Grace]]

Latest revision as of 13:59, 3 November 2021

Grace Quick Start Guide

Deployment Status

Cluster deployed, currently in testing and early user access mode.

Grace Usage Policies

Access to Grace is granted with the condition that you will understand and adhere to all TAMU HPRC and Grace-specific policies.

General policies can be found on the HPRC Policies page.

Accessing Grace

Most access to Grace is done via a secure shell session. In addition, two-factor authentication is required to login to any cluster.

Users on Windows computers use either PuTTY or MobaXterm. If MobaXterm works on your computer, it is usually easier to use. When starting an ssh session in PuTTY, choose the connection type 'SSH', select port 22, and then type the hostname 'grace.hprc.tamu.edu'. For MobaXterm, select 'Session', 'SSH', and then remote host 'grace.hprc.tamu.edu'. Check the box to specify username and type your NetID. After selecting 'Ok', you will be prompted for Duo Two Factor Authentication. For more detailed instructions, visit the Two Factor Authentication page.

Users on Mac and Linux/Unix should use whatever SSH-capable terminal is available on their system. The command to connect to Grace is as follows. Be sure to replace [NetID] with your TAMU NetID.

[user1@localhost ~]$ ssh [NetID]@grace.hprc.tamu.edu

Note: In this example [user1@localhost ~]$ represents the command prompt on your local machine.
Your login password is the same that used on Howdy. You will not see your password as your type it into the login prompt.

Off Campus Access

Please visit this page to find information on accessing Grace remotely.

For more detailed instructions on how to access our systems, please see the HPRC Access page.


Navigating Grace & Storage Quotas

When you first access Grace, you will be within your home directory. This directory has smaller storage quotas and should not be used for general purpose.

You can navigate to your home directory with the following command:

[NetID@grace1 ~]$ cd /home/NetID

Your scratch directory has more storage space than your home directory and is recommended for general purpose use. You can navigate to your scratch directory with the following command:

[NetID@grace1 ~]$ cd /scratch/user/NetID

You can navigate to scratch or home easily by using their respective environment variables.

Navigate to scratch with the following command:

[NetID@grace1 ~]$ cd $SCRATCH

Navigate to home with the following command:

[NetID@grace1 ~]$ cd $HOME

Your scratch directory is restricted to 1TB/250,000 files of storage. This storage quota is expandable upon request. A user's scratch directory is NOT backed up.

Your home directory is restricted to 10GB/10,000 files of storage. This storage quota is not expandable. A user's home directory is backed up on a nightly basis.

You can see the current status of your storage quotas with:

[NetID@grace1 ~]$ showquota

If you need a storage quota increase, please contact us with justification and the expected length of time that you will need the quota increase.

Transferring Files

Files can be transferred to Grace using the scp command or a file transfer program.

Our users most commonly utilize:

Advice: while GUIs are acceptable for file transfers, the cp and scp commands are much quicker and may significantly benefit your workflow.

Reliably Transferring Large Files

For files larger than several GB, you will want to consider the use of a more fault-tolerant utility such as rsync.

[NetID@grace1 ~]$ rsync -av [-z] localdir/ userid@remotesystem:/path/to/remotedir/


Managing Project Accounts

The batch system will charge SUs from the either the account specified in the job parameters, or from your default account (if this parameter is omitted). To avoid errors in SU billing, you can view your active accounts, and set your default account using the myproject command.

Finding Software

Software on Grace is loaded using hierarchical modules.

A list of the most popular software on our systems is available on the HPRC Available Software page.

To list all software installed as a module on Grace, use the mla utility:

[NetID@grace1 ~]$ mla

To search for a specific piece of software installed as a module on Grace using the mla utility:

[NetID@grace1 ~]$ mla keyword

To search for particular software by keyword, use:

[NetID@grace1 ~]$ module spider keyword

To see how to load a module, use the full module name:

[NetID@grace1 ~]$ module spider Perl/5.32.0

You will see a message like the following

You will need to load all module(s) on any one of the lines below before the "Perl/5.32.0" module is available to load.

      GCCcore/10.2.0

Load the base dependency module(s) first then the full module name

[NetID@grace1 ~]$ module load GCCcore/10.2.0  Perl/5.32.0

To list all currently loaded modules, use:

[NetID@grace1 ~]$ module list

To see what other modules can be loaded with the base dependency module (for example when GCCcore/10.2.0 is loaded)

[NetID@grace1 ~]$ module avail

To remove all currently loaded modules, use:

[NetID@grace1 ~]$ module purge

If you need new software or an update, please contact us with your request.

There are restrictions on what software we can install. There is also regularly a queue of requested software installations.

Please account for delays in your installation request timeline.

Running Your Program / Preparing a Job File

In order to properly run a program on Grace, you will need to create a job file and submit a job to the batch system. The batch system is a load distribution implementation that ensures convenient and fair use of a shared resource. Submitting jobs to a batch system allows a user to reserve specific resources with minimal interference to other users. All users are required to submit resource-intensive processing to the compute nodes through the batch system - attempting to circumvent the batch system is not allowed.

On Grace, Slurm is the batch system that provides job management. More information on Slurm can be found in the Grace Batch page.


The simple example job file below requests 1 core on 1 node with 2.5GB of RAM for 1.5 hours. Note that typical nodes on Grace have 48 cores with 384 GB of usable memory and ensure that your job requirements will fit within these restrictions. Any modules that need to be loaded or executable commands will replace the "#First Executable Line" in this example.

#!/bin/bash
##ENVIRONMENT SETTINGS; CHANGE WITH CAUTION
#SBATCH --export=NONE        #Do not propagate environment
#SBATCH --get-user-env=L     #Replicate login environment
  
##NECESSARY JOB SPECIFICATIONS
#SBATCH --job-name=JobExample1     #Set the job name to "JobExample1"
#SBATCH --time=01:30:00            #Set the wall clock limit to 1hr and 30min
#SBATCH --ntasks=1                 #Request 1 task
#SBATCH --ntasks-per-node=1        #Request 1 task/core per node
#SBATCH --mem=2560M                #Request 2560MB (2.5GB) per node
#SBATCH --output=Example1Out.%j    #Send stdout/err to "Example1Out.[jobID]"

#First Executable Line

Note: If your job file has been written on an older Mac or DOS workstation, you will need to use "dos2unix" to remove certain characters that interfere with parsing the script.

[NetID@grace1 ~]$ dos2unix MyJob.slurm

More information on job options can be found in the Building Job Files section of the Grace Batch page.

More information on dos2unix can be found on the dos2unix section of the HPRC Available Software page.

Submitting and Monitoring Jobs

Once you have your job file ready, it is time to submit your job. You can submit your job to slurm with the following command:

[NetID@grace1 ~]$ sbatch MyJob.slurm
Submitted batch job 3606

After the job has been submitted, you are able to monitor it with several methods. To see the status of all of your jobs, use the following command:

[NetID@grace1 ~]$ squeue -u NetID
JOBID       NAME                USER                    PARTITION   NODES CPUS STATE       TIME        TIME_LEFT   START_TIME           REASON      NODELIST            
3606        myjob2              NetID                   short       1     3    RUNNING     0:30        00:10:30    2016-11-27T23:44:12  None        tnxt-[0340]  

To see the status of one job, use the following command, where XXXX is the JobID:

[NetID@grace1 ~]$ squeue --job XXXX
JOBID       NAME                USER                    PARTITION   NODES CPUS STATE       TIME        TIME_LEFT   START_TIME           REASON      NODELIST            
XXXX        myjob2              NetID                   short       1     3    RUNNING     0:30        00:10:30    2016-11-27T23:44:12  None        tnxt-[0340]  

To cancel a job, use the following command, where XXXX is the JobID:

[NetID@grace1 ~]$ scancel XXXX

More information on Job Submission and Job Monitoring Slurm jobs can be found at the Grace Batch System page.

tamubatch

tamubatch is an automatic batch job script that submits jobs for the user without the need of writing a batch script on the clusters. The user just needs to provide the executable commands in a text file and tamubatch will automatically submit the job to the cluster. There are flags that the user may specify which allows control over the parameters for the job submitted.

tamubatch is still in beta and has not been fully developed. Although there are still bugs and testing issues that are currently being worked on, tamubatch can already submit jobs to both the clusters if given a file of executable commands.

For more information, visit this page.

Graphic User Interfaces (Visualization)

The use of GUIs on Grace is a more complicated process than running non-interactive jobs or doing resource-light interactive processing.

You have two options for using GUIs on Grace.

The first option is to use the Open On Demand Portal, which is a web interface to our clusters. Users must be connected to the campus network either directly or via VPN to access the portal. More information can be found here, or on our YouTube channel

The second option is to run on the login node. When doing this, you must observe the fair-use policy of login node usage. Users commonly violate these policies by accident, resulting in terminated processes, confusion, and warnings from our admins.

Deep Learning with TensorFlow and PyTorch

Installing Python venv

  # load all the required modules
  ml purge
          
  # CUDA modules are needed for TensorFlow
  ml GCCcore/9.3.0 GCC/9.3.0 Python/3.8.2 CUDAcore/11.0.2 CUDA/11.0.2 cuDNN/8.0.5.39-CUDA-11.0.2
          
  # As Pytorch comes with CUDA libraries, we don't need to load CUDA modules.
  # the following two modules are sufficient for PyTorch
  # ml GCCcore/10.2.0 Python/3.8.6
          
  # you can save your module list with (dl is an arbitrary name)
  module save dl
       
  # next time when you login you can simply run
  module restore dl
     
  # create a virtual environment (the name dlvenv is arbitrary)
  cd $SCRATCH
  python -m venv dlvenv
  source dlvenv/bin/activate

Installing Python packages

  # First upgrade pip to avoid warning messages
  pip install -U pip
     
  # You can watch GPU usage on another terminal with
  watch -n 0.5 nvidia-smi

TensorFlow

  # install TensorFlow and other packages as needed.
  pip install tensorflow
  
  # Try things out (note that the login nodes grace4 and grace5 don't have GPUs)
  python -c "import tensorflow as tf; print(tf.test.gpu_device_name())"

PyTorch

  pip install torch torchvision
  
  # Try things out (note that the login nodes grace4 and grace5 don't have GPUs)
  python -c "import torch; print(torch.device('cuda:0' if torch.cuda.is_available() else 'cpu'))"

All other python packages could be installed with pip install accordingly.

Sample Job Script - dl.slurm

  #!/bin/bash
  ##ENVIRONMENT SETTINGS; CHANGE WITH CAUTION
  #SBATCH --export=NONE            	#Do not propagate environment
  #SBATCH --get-user-env=L         	#Replicate login environment
        
  ##NECESSARY JOB SPECIFICATIONS
  #SBATCH --job-name=JobExample4   	#Set the job name to "JobExample4"
  #SBATCH --time=00:30:00          	#Set the wall clock limit to 1hr and 30min
  #SBATCH --ntasks=1               	#Request 1 task
  #SBATCH --mem=2560M              	#Request 2560MB (2.5GB) per task
  #SBATCH --output=Example4Out.%j  	#Send stdout/err to "Example4Out.[jobID]"
  #SBATCH --gres=gpu:1             	#Request 1 GPU per node can be 1 or 2
  #SBATCH --partition=gpu          	#Request the GPU partition/queue
      
  # modules needed for running DL jobs. Module restore will also work
  #module restore dl 
  ml GCCcore/9.3.0 GCC/9.3.0 Python/3.8.2 CUDAcore/11.0.2 CUDA/11.0.2 cuDNN/8.0.5.39-CUDA-11.0.2
      
  # Python venv
  source $SCRATCH/dlvenv/bin/activate
   
  # scripts or executables
  cd $SCRACTH/mywonderfulproject
  python TuringTest.py

Submit your slurm job with sbatch

  sbatch dl.slurm