AlphaFold
GPU Support
-
Currently AlphaFold only supports NVIDIA GPUs and not Intel PVC GPUs.
-
Currently AlphaFold only supports running on a single GPU
-
Use ParaFold to run AlphaFold On the HPRC clusters, especially ACES, to avoid idle GPU usage.
GCATemplates
Description
AlphaFold is an AI system that predicts a protein’s 3D structure from its amino acid sequence.
AlphaFold 3 is not available on HPRC clusters since it is not open source.
Scientists can access the majority of its capabilities, for free, through the AlphaFold Server
AlphaFold homepage
AlphaFold 2.3.2+ is available on Grace as a singularity container created from tacc/alphafold
/sw/hprc/sw/bio/containers/alphafold/alphafold_2.3.2.sif
The AlphaFold 2.3.2 database and newer versions are found in the following directory
/scratch/data/bio/alphafold/
If using the singularity image, you can either run alphafold using the following command while using the VNC portal app or from within your job script.
Be sure to select a GPU when launching the VNC app.
singularity exec /sw/hprc/sw/containers/alphafold/alphafold_2.3.2.sif python /app/alphafold/run_alphafold.py --helpfull
See the Biocontainers page for additional details on how to use .sif image files.
ParaFold
ParaFold is the preferred approach to run AlphaFold on the HPRC clusters, especially ACES, since it does not hold the GPU idle for hours during the multiple sequence alignment step.
ParaFold This project is a modified version of DeepMind's AlphaFold2 to achieve high-throughput protein structure prediction.
example job script
This is an example job script that will run the two parts of parafold. The first part is a job that runs multiple sequence alignments on a CPU only node. When that job completes successfully, the second job is submitted by the first job script. The second job will run the structure prediction part on a GPU node.
#!/bin/bash
#SBATCH --job-name=parafold-cpu # job name
#SBATCH --time=1-00:00:00 # max job run time dd-hh:mm:ss
#SBATCH --ntasks-per-node=1 # tasks (commands) per compute node
#SBATCH --cpus-per-task=48 # CPUs (threads) per command
#SBATCH --mem=488G # total memory per node
#SBATCH --output=stdout.%x.%j # save stdout to file
#SBATCH --error=stderr.%x.%j # save stderr to file
module purge
module load GCC/11.3.0 OpenMPI/4.1.4 ParaFold/2.0-CUDA-11.8.0
ALPHAFOLD_DATA_DIR=/scratch/data/bio/alphafold/2.3.2
protein_fasta='/scratch/data/bio/alphafold/example_data/1L2Y.fasta'
model_preset=multimer # monomer, monomer_casp14, monomer_ptm, multimer
max_template_date=2024-1-1
jobstats &
# First, run CPU-only steps to get multiple sequence alignments:
run_alphafold.sh -d $ALPHAFOLD_DATA_DIR -o parafold_output_dir -i $protein_fasta -p $model_preset -t $max_template_date -f
# Second, run GPU steps as a separate job after the first part completes successfully:
sbatch --job-name=parafold-gpu --time=2-00:00:00 --ntasks-per-node=1 --cpus-per-task=24 --mem=122G --gres=gpu:h100:1 --partition=gpu --output=stdout.%x.%j --error=stderr.%x.%j --dependency=afterok:$SLURM_JOBID<<EOF
#!/bin/bash
module purge
module load GCC/11.3.0 OpenMPI/4.1.4 ParaFold/2.0-CUDA-11.8.0
jobstats -i 1 &
echo "run_alphafold.sh -g -u 0 -d $ALPHAFOLD_DATA_DIR -o $pf_output_dir -p $model_preset -i $protein_fasta -t $max_template_date"
run_alphafold.sh -g -u 0 -d $ALPHAFOLD_DATA_DIR -o $pf_output_dir -p $model_preset -i $protein_fasta -t $max_template_date
# graph pLDDT and PAE .pkl files
run_AlphaPickle.py -od $pickle_out_dir
jobstats
EOF
jobstats
Example Grace Job Scripts
ParaFold is available on Grace and is the preferred approach to running AlphaFold because it avoids idle GPU usage.
The following is an example of using the Singularity image to run AlphaFold.
Example IL2Y amino acid sequence:
NLYIQWLKDGGPSSGRPPPS
Monomer
#!/bin/bash
#SBATCH --job-name=alphafold # Grace job name
#SBATCH --time=2-00:00:00 # max job run time dd-hh:mm:ss
#SBATCH --ntasks-per-node=1 # tasks (commands) per compute node
#SBATCH --cpus-per-task=24 # CPUs (threads) per command
#SBATCH --mem=180G # total memory per node
#SBATCH --gres=gpu:a100:1 # request 1 A100 GPU
#SBATCH --output=stdout.%x.%j # save stdout to file
#SBATCH --error=stderr.%x.%j # save stderr to file
module purge
# the next two lines can be added if your job runs out of GPU memory
export SINGULARITYENV_TF_FORCE_UNIFIED_MEMORY=1
export SINGULARITYENV_XLA_PYTHON_CLIENT_MEM_FRACTION=4.0
DOWNLOAD_DIR=/scratch/data/bio/alphafold/2.3.2
# run gpustats in the background (&) to monitor gpu usage in order to create a graph later
gpustats &
singularity exec --nv /sw/hprc/sw/containers/alphafold/alphafold_2.3.2.sif python /app/alphafold/run_alphafold.py \
--data_dir=$DOWNLOAD_DIR --use_gpu_relax \
--uniref90_database_path=$DOWNLOAD_DIR/uniref90/uniref90.fasta \
--mgnify_database_path=$DOWNLOAD_DIR/mgnify/mgy_clusters_2022_05.fa \
--bfd_database_path=$DOWNLOAD_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--uniref30_database_path=$DOWNLOAD_DIR/uniclust30/UniRef30_2022_02 \
--pdb70_database_path=$DOWNLOAD_DIR/pdb70/pdb70 \
--template_mmcif_dir=$DOWNLOAD_DIR/pdb_mmcif/mmcif_files \
--obsolete_pdbs_path=$DOWNLOAD_DIR/pdb_mmcif/obsolete.dat \
--model_preset=monomer \
--max_template_date=2024-1-1 \
--db_preset=full_dbs \
--output_dir=output_dir \
--fasta_paths=IL2Y.fasta
# run gpustats to create a graph of gpu usage for this job
gpustats
Multimer
#!/bin/bash
#SBATCH --job-name=alphafold # Grace job name
#SBATCH --time=2-00:00:00 # max job run time dd-hh:mm:ss
#SBATCH --ntasks-per-node=1 # tasks (commands) per compute node
#SBATCH --cpus-per-task=24 # CPUs (threads) per command
#SBATCH --mem=180G # total memory per node
#SBATCH --gres=gpu:a100:1 # request 1 A100 GPU
#SBATCH --output=stdout.%x.%j # save stdout to file
#SBATCH --error=stderr.%x.%j # save stderr to file
module purge
export SINGULARITYENV_TF_FORCE_UNIFIED_MEMORY=1
export SINGULARITYENV_XLA_PYTHON_CLIENT_MEM_FRACTION=4.0
DOWNLOAD_DIR=/scratch/data/bio/alphafold/2.3.2
# run gpustats in the background (&) to monitor gpu usage in order to create a graph later
gpustats &
singularity exec --nv /sw/hprc/sw/bio/containers/alphafold/alphafold_2.3.2.sif python /app/alphafold/run_alphafold.py \
--use_gpu_relax \
--data_dir=$DOWNLOAD_DIR \
--uniref90_database_path=$DOWNLOAD_DIR/uniref90/uniref90.fasta \
--mgnify_database_path=$DOWNLOAD_DIR/mgnify/mgy_clusters_2022_05.fa \
--bfd_database_path=$DOWNLOAD_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--uniref30_database_path=$DOWNLOAD_DIR/uniclust30/UniRef30_2022_02 \
--pdb_seqres_database_path=$DOWNLOAD_DIR/pdb_seqres/pdb_seqres.txt \
--uniprot_database_path=$DOWNLOAD_DIR/uniprot/uniprot.fasta \
--template_mmcif_dir=$DOWNLOAD_DIR/pdb_mmcif/mmcif_files \
--obsolete_pdbs_path=$DOWNLOAD_DIR/pdb_mmcif/obsolete.dat \
--model_preset=multimer \
--max_template_date=2024-1-1 \
--db_preset=full_dbs \
--output_dir=out_alphafold \
--fasta_paths=/scratch/data/bio/alphafold/example_data/T1083_T1084_multimer.fasta
# run gpustats to create a graph of gpu usage for this job
gpustats
Multiple Monomers
If you have multiple protein sequences from the same organism, running multiple monomers in one job script will significantly speed up processing.
The following job script uses the AlphaFold/2.3.1 module instead of the singularity image.
The AlphaFold/2.3.1 module is usually one version behind the singularity image.
#!/bin/bash
#SBATCH --job-name=alphafold_multimer # Grace job name
#SBATCH --time=2-00:00:00 # max job run time dd-hh:mm:ss
#SBATCH --ntasks-per-node=1 # tasks (commands) per compute node
#SBATCH --cpus-per-task=24 # CPUs (threads) per command
#SBATCH --mem=180G # total memory per node
#SBATCH --gres=gpu:a100:1 # request one a100 GPU
#SBATCH --output=stdout.%x.%j # save stdout to file
#SBATCH --error=stderr.%x.%j # save stderr to file
module purge
module load GCC/11.3.0 OpenMPI/4.1.4 AlphaFold/2.3.1-CUDA-11.7.0
module load AlphaPickle/1.4.1
DOWNLOAD_DIR=/scratch/data/bio/alphafold/2.3.2
# start a process to monitor GPU usage
gpustats &
run_alphafold.py \
--data_dir=$DOWNLOAD_DIR \
--uniref90_database_path=$DOWNLOAD_DIR/uniref90/uniref90.fasta \
--mgnify_database_path=$DOWNLOAD_DIR/mgnify/mgy_clusters_2018_12.fa \
--bfd_database_path=$DOWNLOAD_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--uniclust30_database_path=$DOWNLOAD_DIR/uniclust30/uniclust30_2021_03/UniRef30_2021_03 \
--pdb70_database_path=$DOWNLOAD_DIR/pdb70/pdb70 \
--template_mmcif_dir=$DOWNLOAD_DIR/pdb_mmcif/mmcif_files \
--obsolete_pdbs_path=$DOWNLOAD_DIR/pdb_mmcif/obsolete.dat \
--model_preset=monomer \
--max_template_date=2024-1-1 \
--db_preset=full_dbs \
--output_dir=out_alphafold_2.3.1_multi-monomers \
--fasta_paths=T1083.fasta,T1084.fasta
# create a graph of GPU resource usage stats
gpustats
- The only parameter you need to change in the example job script is the path to your fasta file: --fasta_paths=
- You can also change these: --max_template_date and --output_dir
- Use the lines that have $DOWNLOAD_DIR exactly as they are and do not change these.
The run_alphafold.py command initially starts running on CPU but later only runs on a single GPU so specify one GPU in the #SBATCH --gres parameter and a fraction of the system memory and cores so someone else can use the the other GPU(s).
The example_data/T1050.fasta takes 16 hours on CPU on Grace.
GPU Runtime
A100 4 hr 9 min
RTX 6000 3 hr 58 min
T4 5 hr 12 min
all GPUs used about 62GB of system memory
JAX memory allocation is explained here
Note: if you do not use the --nv option, AlphaFold will run on CPU only
Visualize Results
AlphaPickle can be used to create graphs for visualizing each of the model .pkl files.
#!/bin/bash
#SBATCH --job-name=alphafold # job name
#SBATCH --time=2-00:00:00 # max job run time dd-hh:mm:ss
#SBATCH --ntasks-per-node=1 # tasks (commands) per compute node
#SBATCH --cpus-per-task=24 # CPUs (threads) per command
#SBATCH --mem=180G # total memory per node
#SBATCH --gres=gpu:a100:1 # request 1 A100 GPU
#SBATCH --output=stdout.%x.%j # save stdout to file
#SBATCH --error=stderr.%x.%j # save stderr to file
module purge
module load GCC/10.2.0 CUDA/11.1.1
OpenMPI/4.0.5 AlphaPickle/1.4.1
export SINGULARITYENV_TF_FORCE_UNIFIED_MEMORY=1
export SINGULARITYENV_XLA_PYTHON_CLIENT_MEM_FRACTION=4.0
DOWNLOAD_DIR=/scratch/data/bio/alphafold/2.3.2
# run gpustats in the background (&) to monitor gpu usage in order to create a graph later
gpustats &
singularity exec --nv /sw/hprc/sw/bio/containers/alphafold/alphafold_2.1.2.sif python /app/alphafold/run_alphafold.py \
--use_gpu_relax \
--data_dir=$DOWNLOAD_DIR \
--uniref90_database_path=$DOWNLOAD_DIR/uniref90/uniref90.fasta \
--mgnify_database_path=$DOWNLOAD_DIR/mgnify/mgy_clusters_2018_12.fa \
--bfd_database_path=$DOWNLOAD_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--uniclust30_database_path=$DOWNLOAD_DIR/uniclust30/uniclust30_2021_03/UniRef30_2021_03 \
--pdb70_database_path=$DOWNLOAD_DIR/pdb70/pdb70 \
--template_mmcif_dir=$DOWNLOAD_DIR/pdb_mmcif/mmcif_files \
--obsolete_pdbs_path=$DOWNLOAD_DIR/pdb_mmcif/obsolete.dat \
--model_preset=monomer \
--max_template_date=2022-1-1 \
--db_preset=full_dbs \
--output_dir=out_alphafold \
--fasta_paths=/scratch/data/bio/alphafold/example_data/T1050.fasta
# run gpustats to create a graph of gpu usage for this job
gpustats
# run AlphaPickle to create a graph for each model .pkl file.
# Name the -od directory based on how you named --output_dir and --fasta_paths in the run_alphafold.py command
run_AlphaPickle.py -od out_alphafold/T1050
Database directory
The AlphaFold database directory is structured as follows: check for the latest version in /scratch/data/bio/alphafold/
/scratch/data/bio/alphafold/2.3.2/
├── bfd
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata
│ └── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex
├── example_data
│ ├── 1L2Y.fasta
│ ├── mmcif_3geh.fa
│ ├── T1083.fasta
│ └── T1083_T1084_multimer.fasta
├── mgnify
│ └── mgy_clusters_2022_05.fa
├── params
│ ├── LICENSE
│ ├── params_model_1_multimer_v3.npz
│ ├── params_model_1.npz
│ ├── params_model_1_ptm.npz
│ ├── params_model_2_multimer_v3.npz
│ ├── params_model_2.npz
│ ├── params_model_2_ptm.npz
│ ├── params_model_3_multimer_v3.npz
│ ├── params_model_3.npz
│ ├── params_model_3_ptm.npz
│ ├── params_model_4_multimer_v3.npz
│ ├── params_model_4.npz
│ ├── params_model_4_ptm.npz
│ ├── params_model_5_multimer_v3.npz
│ ├── params_model_5.npz
│ └── params_model_5_ptm.npz
├── pdb70
│ ├── md5sum
│ ├── pdb70_a3m.ffdata
│ ├── pdb70_a3m.ffindex
│ ├── pdb70_clu.tsv
│ ├── pdb70_cs219.ffdata
│ ├── pdb70_cs219.ffindex
│ ├── pdb70_hhm.ffdata
│ ├── pdb70_hhm.ffindex
│ └── pdb_filter.dat
├── pdb_mmcif
│ ├── mmcif_files
│ └── obsolete.dat
├── pdb_seqres
│ └── pdb_seqres.txt
├── small_bfd
│ └── bfd-first_non_consensus_sequences.fasta
├── uniprot
│ └── uniprot.fasta
├── uniref30
│ ├── UniRef30_2023_02_a3m.ffdata
│ ├── UniRef30_2023_02_a3m.ffindex
│ ├── UniRef30_2023_02_cs219.ffdata
│ ├── UniRef30_2023_02_cs219.ffindex
│ ├── UniRef30_2023_02_hhm.ffdata
│ ├── UniRef30_2023_02_hhm.ffindex
│ └── UniRef30_2023_02.md5sums
└── uniref90
└── uniref90.fasta
Usage
If using the AlphaFold software module, run the following for usage:
run_alphafold.py --helpfull
If using the ParaFold software module, run the following for usage:
run_alphafold.sh --help
FAQ
Q. I'm seeing the message "RuntimeError: Resource exhausted: Out of memory"
A. make sure you have the following two lines in your AlphaFold job script. This is not needed with ParaFold since it is automatically configured in the ParaFold software module.
export SINGULARITYENV_TF_FORCE_UNIFIED_MEMORY=1
export SINGULARITYENV_XLA_PYTHON_CLIENT_MEM_FRACTION=4.0
Q. I'm seeing the message: raise ValueError('The number of positions must match the number of atoms')
A. See this bug report
Q. How do I view the contents of a .pkl file such as result_model_2.pkl?
A. Use python to view the contents. More code needed to save to a file.
module purge
module load GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 AlphaPickle/1.4.1
python
>>> import pandas as pd
>>> myresults=pd.read_pickle('/path/to/your/result_model_2.pkl')
>>> print(myresults)
Q. How do I create graphs of the .pkl model files
A. Use AlphaPickle which can be run on the login node after an AlphaFold job is complete since AlphaPickle takes less than a minute to complete.
module purge
module load GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 AlphaPickle/1.4.1
# run AlphaPickle to create a graph for each model .pkl file.
# Name the -od directory based on how you named --output_dir and --fasta_paths in the run_alphafold.py command
run_AlphaPickle.py -od out_alphafold/T1050
Q. How can I extract ptm, iptm and ranking_confidence values from a pickle file such as result_model_2.pkl?
A. Use pickle2csv.py to extract the values for ptm, iptm and ranking_confidence when using --model_preset=monomer_ptm
If you used --model_preset=monomer then only ranking_confidence will be extracted
module purge
module load GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 AlphaPickle/1.4.1
pickle2csv.py -i path/to/pickle/file/result_model_2.pkl -o output.csv
Q. Does AlphaFold run on the ACES PVC accelerators?
A. No. PVC accelerators are not supported by AlphaFold
Citation
- If you use an AlphaFold prediction in your work, please cite the following papers:
Jumper, J et al. Highly accurate protein structure prediction with AlphaFold. Nature (2021).
Varadi, M et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Research (2021).
- In addition, if you use the AlphaFold-Multimer mode, please cite:
Evans, R et al. Protein complex prediction with AlphaFold-Multimer, doi.org/10.1101/2021.10.04.463034
- In addition, if you use ParaFold, please cite:
Bozitao Zhong, Xiaoming Su, Minhua Wen, Sichen Zuo, Liang Hong, James Lin. ParaFold: Paralleling AlphaFold for Large-Scale Predictions. 2021. arXiv:2111.06340. doi.org/10.48550/arXiv.2111.06340
- In addition, if you use AlphaPickle, please cite:
Arnold, M. J. (2021) AlphaPickle, doi.org/10.5281/zenodo.5708709