Research Data Management
Texas A&M High Performance Research Computing provides research data management solutions and technology for the research projects on the ACES computing cluster. We offer training and consulting on data management plans, data preparation, analysis, storage and computing resources in various domains including AI/ML, Molecular dynamics, Bioinformatics, etc.
We maintain a centrally accessible repository of widely-used datasets on ACES, located at /scratch/data
, available to all researchers. These datasets cover a broad range of domains including computer vision, speech, graph, and multimodal data. For instance, the collection includes the ILSVRC2012 dataset (ImageNet Large Scale Visual Recognition Challenge 2012):
-
For PyTorch models:
/scratch/data/pytorch-computer-vision-datasets
-
For TensorFlow models:
/scratch/data/tensorflow-computer-vision-datasets
If you would like to request the addition of another widely-used community dataset to this central location, please contact us at help@hprc.tamu.edu. We will review your request and respond accordingly.
Dataset Modules
The datasets are made available as modules for researchers to easily access and use. To view all available dataset modules, run:
module avail datasets
To search for a specific dataset, view its description, citation information, and environment variable for path access, use:
module spider <dataset>
Once identified, you can load the dataset into your environment with:
module load <dataset>
Data Storage
The standard quota for $SCRATCH directory on HPRC clusters is 1TB. Quotas can be increased with justification and management approval. Please email help@hprc.tamu.edu
for help with your quota needs or make a request on ACES Dashboard.
If you have an NSF ACCESS allocation on ACES, you will be provided a 5TB project directory on it.
Data Transfer
Globus Connect is a high-performance file transfer platform that enables seamless transfer of large data between systems or endpoints. Users can schedule transfers through a web-based interface and receive notifications upon completion. Endpoints may include systems with Globus installed, such as aces-dtn, a user’s personal desktop or Microsoft OneDrive. More details are here
rclone is a tool for syncing files from HPRC systems to remote storage sites like Google Drive. See how to transfer files here
AI/ML
Hugging Face
Hugging Face is an open-source platform that provides tools, models, and datasets for AI/ML. ACES enables seamless connection to Hugging Face by installing huggingface_hub
, transformers
, and git-lfs
for manage large files (like model weights) efficiently.
Example to Download a Model
Load huggingface_hub and transformers
module load huggingface_hub-<version> Transformers/<version>
Authenticate with Hugging Face (optional, for Gated Models)
huggingface-cli login
Paste your Hugging Face token (found at HuggingFace Tokens).
Set a cache location for the models (preferrably in $SCRATCH)
export HF_HOME=/path/to/your/dir/for/.cache/huggingface
Download the Model
You can use a Python script to download the model. Use Llama-2-7b-chat-hf
for example:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "meta-llama/Llama-2-7b-chat-hf" # replace with your desired model
# Downloads and caches the model in the cache location
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
You can also use git clone
and Git LFS (Large File Storage) to download models from Hugging Face respositories.
module load git-lfs/<version>
git clone https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
Download a Dataset
Create and activate your conda/virutal environment:
module load Anaconda/<version>
conda create -n your-env
source activate your-env
pip install datasets
Download the dataset in a Python script:
from datasets import load_dataset
# Replace 'squad' with your desired dataset
dataset = load_dataset("squad")
ACES OOD Apps for Hugging Face Model & Dataset Downloads
We also provide an interactive app on ACES OpenOnDemand portal for users to directly download Hugging Face models and datasets to the ACES computing cluster.
Roboflow Universe
Roboflow Universe is a large open repository of open-source computer vision datasets and pretrained models. You can directly download a dataset and transfer it to ACES using Globus for example.
Training and Consulting
- Introductory and Intermediate Python for Data Science
- Introduction to Data Science in R
- Data Science Meets Geoscience
- AI TechLab in Jupyter Notebooks
- AI TechLab on Graphcore IPUs
- AI TechLab on Intel PVC GPUs
- BYOC sessions
- BYOG sessions
Training and Consulting Team
- Dr. Zhenhua He, Associate Research Scientists, Computer Science and Geoscience, happidence1@tamu.edu
- Dr. Wesley Brashear, Associate Research Scientist, genetics, wbrashear@tamu.edu
- Richard Lawrence, User Support Specialist, Physics, rarensu@tamu.edu
- Dr. Joshua Winchell, Assistant Research Scientist, Physics, jwinchell@tamu.edu
- Dr. James Mao, Assistant Research Scientist, Chemistry, jwinchell@tamu.edu