TensorFlow
Description
TensorFlow is an open source software library for numerical computation using data flow graphs.
- Homepage: https://www.tensorflow.org/
Access
TensorFlow is open to all HPRC users.
There are restrictions as to which version (GPU/CPU) of TensorFlow works on each cluster. Please note these restrictions in the following sections and plan your jobs accordingly.
TensorFlow Modules
TAMU HPRC currently supports the user of TensorFlow through the module system.
module spider TensorFlow
module load GCC/11.3.0 OpenMPI/4.1.4 TensorFlow/2.13.0-CUDA-11.8.0
You can learn more about the module system on our SW:Modules page.
Example TensorFlow Script
As with most cluster use, TensorFlow should be used via the submission of a job file. Scripts using TensorFlow are written in Python, and thus TensorFlow scripts should not be written directly inside a job file or entered in the shell line by line. Instead, a separate file for the Python/TensorFlow script should be created, which can then be executed by the job file.
Below is an example script (entered in the text editor of your choice):
import tensorflow as tf
x = tf.constant(35, name='x')
y = tf.Variable(x + 5, name ='y')
model = tf.global_variables_initializer()
with tf.Session() as session:
session.run(model)
print(session.run(y))
It is recommended to save this script with a .py file extension, but not necessary.
Once saved, the script can be tested on a login node by entering:
[NetID@cluster ~]$ python testscript.py
NOTE: Make sure to run this command from the same directory that the script is saved in.
NOTE: While acceptable to test programs on the login node, please do not run extended or intense computation on these shared resources. Use a batch job and the compute nodes for heavy processing.
NOTE: Multi-core TensorFlow scripts potentially use ALL available cores by default. This can inadvertently crash the login node. Multi-core TensorFlow scripts must be tested within a batch job.
Installing Additional Python/TensorFlow Packages
While multiple versions of Python, Anaconda, and TensorFlow are available on our clusters, it is at times desired to have some specialized libraries or packages installed in addition to our pre-installed software.
Software installations by HPRC staff is usually reserved for generalized/popular or complex installations. There is a 10-15 business day (2-3 week) turn-around time for software installation requests.
You are encouraged to save time by trying to install your own Python/TensorFlow packages via the process in the follow sections.
General Installation Notes
User installation of Python packages is straightforward except for a few conditions. The following installation notes comprise most issues users encounter.
- Disk/File Quota: Storage within $HOME is limited. Packages should be installed in user's $SCRATCH. See Grace File Systems for more info.
- Internet Connection: Only login nodes have access to the Internet. Attempting to use pip or other Internet access from a batch job (on a compute node) will fail.
- Mixing Tools: myPython, Anaconda, and myAnaconda must be used exclusively of one another. Attempting to mix these in order to add a package/library will fail.
- Version Compatibility: Some packages require specific/older version of TensorFlow to be installed. Please verify package compatibility against available versions.
Adding TensorFlow Packages: Grace
For TensorFlow packages that require installation:
ml purge
ml GCC/11.3.0 OpenMPI/4.1.4 Python/3.10.4
python -m venv tdenv
source tdenv/bin/activate
pip install [link to package file]
Usage on the Login Nodes
Please limit interactive processing to short, non-intensive usage. Use non-interactive batch jobs for resource-intensive and/or multiple-core processing. Users are requested to be responsible and courteous to other users when using software on the login nodes.
The most important processing limits here are:
* ONE HOUR of PROCESSING TIME per login session.
- EIGHT CORES per login session on the same node or (cumulatively) across all login nodes.
Anyone found violating the processing limits
will have their processes killed without warning. Repeated violation of
these limits will result in account suspension.
Note: Your login session will disconnect after
one hour of inactivity.
Usage on the Compute Nodes
Non-interactive batch jobs on the compute nodes allows for resource-demanding processing. Non-interactive jobs have higher limits on the number of cores, amount of memory, and runtime length.
For instructions on how to create and submit a batch job, please see the appropriate wiki page for each respective cluster:
- Grace: About Grace Batch Processing
Usage on the VNC Nodes
VNC job allow for usage of the a graphical user interface (GUI) without disrupting other users.
Short Course
HPRC hosts a TensorFlow short course once per semester. Information can be found HERE.
The slide deck associated with this TensorFlow short course can be found HERE.
HPRC Publications
This publication includes benchmarks of TensorFlow on HPRC clusters.
- Abhinand Nasari, Hieu Le, Richard Lawrence, Zhenhua He, Xin Yang, Mario Krell, Alex Tsyplikhin, Mahidhar Tatineni, Tim Cockerill, Lisa Perez, Dhruva Chakravorty, and Honggao Liu. 2022. Benchmarking the Performance of Accelerators on National Cyberinfrastructure Resources for Artificial Intelligence / Machine Learning Workloads. In Practice and Experience in Advanced Research Computing 2022: Revolutionary: Computing, Connections, You (PEARC '22). Association for Computing Machinery, New York, NY, USA, Article 19, 1–9. https://doi.org/10.1145/3491418.3530772