TensorFlow

Description

TensorFlow is an open source software library for numerical computation using data flow graphs.

Homepage: https://www.tensorflow.org/

Access

TensorFlow is open to all HPRC users.

There are restrictions as to which version (GPU/CPU) of TensorFlow works on each cluster. Please note these restrictions in the following sections and plan your jobs accordingly.

TensorFlow Modules

You can use the following command to find the versions of TensorFlow that exist on a HPRC cluster.

module spider TensorFlow

For detailed information about a specific TensorFlow version (including how to load the modules), you can run the command.

module spider TensorFlow/<version>

You may need to load its dependencies before the module. Foe example,

module load GCC/<version>  OpenMPI/<version>  TensorFlow/<version>

You can learn more about the module system on our Modules page.

Example TensorFlow Script

As with most cluster use, TensorFlow should be used via the submission of a job file. Scripts using TensorFlow are written in Python, and thus TensorFlow scripts should not be written directly inside a job file or entered in the shell line by line. Instead, a separate file for the Python/TensorFlow script should be created, which can then be executed by the job file.

Below is an example script (entered in the text editor of your choice):

import tensorflow as tf

# Define two constant tensors
a = tf.constant(5)
b = tf.constant(3)

# Perform addition
result = tf.add(a, b)

# Start a session and compute the result
print("The result of the addition is:", result.numpy())

It is recommended to save this script with a .py file extension, but not necessary.

Once saved, the script can be tested on a login node by entering:

[NetID@cluster ~]$ python testscript.py

NOTE: Make sure to run this command from the same directory that the script is saved in.

NOTE: While acceptable to test programs on the login node, please do not run extended or intense computation on these shared resources. Use a batch job and the compute nodes for heavy processing.

NOTE: Multi-core TensorFlow scripts potentially use ALL available cores by default. This can inadvertently crash the login node. Multi-core TensorFlow scripts must be tested within a batch job.

Installing Additional Python/TensorFlow Packages

While multiple versions of Python, Anaconda, and TensorFlow are available on our clusters, it is at times desired to have some specialized libraries or packages installed in addition to our pre-installed software.

Software installations by HPRC staff is usually reserved for generalized/popular or complex installations. There is a 10-15 business day (2-3 week) turn-around time for software installation requests.

You are encouraged to save time by trying to install your own Python/TensorFlow packages via the process in the follow sections.

General Installation Notes

User installation of Python packages is straightforward except for a few conditions. The following installation notes comprise most issues users encounter.

Disk/File Quota: Storage within $HOME is limited. Packages should be installed in user's $SCRATCH.
Internet Connection: Only login nodes have access to the Internet. Attempting to use pip or other Internet access from a batch job (on a compute node) will fail. If needed, load the WebProxy module in your batch job file.
```
module load WebProxy
```
Mixing Tools: myPython, Anaconda, and myAnaconda must be used exclusively of one another. Attempting to mix these in order to add a package/library will fail.
Version Compatibility: Some packages require specific/older version of TensorFlow to be installed. Please verify package compatibility against available versions.

Installing TensorFlow Packages: ACES

For example, TensorFlow packages that require installation:

ml purge
ml GCCcore/11.3.0 Python/3.10.4  
python -m venv tf_env
source tf_env/bin/activate
pip install <TensorFlow packages>

Please limit interactive processing to short, non-intensive usage. Use non-interactive batch jobs for resource-intensive and/or multiple-core processing. Users are requested to be responsible and courteous to other users when using software on the login nodes.

The most important processing limits here are:

ONE HOUR of PROCESSING TIME per login session.
EIGHT CORES per login session on the same node or (cumulatively) across all login nodes.

Anyone found violating the processing limits will have their processes killed without warning. Repeated violation of these limits will result in account suspension.
Note: Your login session will disconnect after one hour of inactivity.

Usage on the Compute Nodes

Non-interactive batch jobs on the compute nodes allows for resource-demanding processing. Non-interactive jobs have higher limits on the number of cores, amount of memory, and runtime length.

For instructions on how to create and submit a batch job, please see the appropriate knowledge base Batch Processing page for each respective cluster:

Usage on the VNC Nodes

VNC job allow for usage of the a graphical user interface (GUI) without disrupting other users.

Short Course

HPRC offers a micro-credential TensorFlow course. Information can be found HERE.

HPRC Publications

This publication includes benchmarks of TensorFlow on HPRC clusters.

Abhinand Nasari, Hieu Le, Richard Lawrence, Zhenhua He, Xin Yang, Mario Krell, Alex Tsyplikhin, Mahidhar Tatineni, Tim Cockerill, Lisa Perez, Dhruva Chakravorty, and Honggao Liu. 2022. Benchmarking the Performance of Accelerators on National Cyberinfrastructure Resources for Artificial Intelligence / Machine Learning Workloads. In Practice and Experience in Advanced Research Computing 2022: Revolutionary: Computing, Connections, You (PEARC '22). Association for Computing Machinery, New York, NY, USA, Article 19, 1–9. https://doi.org/10.1145/3491418.3530772