Skip to content

AlphaFold

GPU Support

  • Currently AlphaFold only supports NVIDIA GPUs and not Intel PVC GPUs.

  • Currently AlphaFold only supports running on a single GPU

  • Use ParaFold to run AlphaFold On the HPRC clusters, especially ACES, to avoid idle GPU usage.

GCATemplates

Description

AlphaFold is an AI system that predicts a protein’s 3D structure from its amino acid sequence.

AlphaFold 3 is not available on HPRC clusters since it is not open source.

Scientists can access the majority of its capabilities, for free, through the AlphaFold Server

AlphaFold homepage

AlphaFold 2.3.2+ is available on Grace as a singularity container created from tacc/alphafold

/sw/hprc/sw/bio/containers/alphafold/alphafold_2.3.2.sif

The AlphaFold 2.3.2 database and newer versions are found in the following directory

/scratch/data/bio/alphafold/

If using the singularity image, you can either run alphafold using the following command while using the VNC portal app or from within your job script.

Be sure to select a GPU when launching the VNC app.

singularity exec /sw/hprc/sw/containers/alphafold/alphafold_2.3.2.sif python /app/alphafold/run_alphafold.py --helpfull

See the Biocontainers page for additional details on how to use .sif image files.

ParaFold

ParaFold is the preferred approach to run AlphaFold on the HPRC clusters, especially ACES, since it does not hold the GPU idle for hours during the multiple sequence alignment step.

ParaFold This project is a modified version of DeepMind's AlphaFold2 to achieve high-throughput protein structure prediction.

example job script

This is an example job script that will run the two parts of parafold. The first part is a job that runs multiple sequence alignments on a CPU only node. When that job completes successfully, the second job is submitted by the first job script. The second job will run the structure prediction part on a GPU node.

#!/bin/bash
#SBATCH --job-name=parafold-cpu     # job name
#SBATCH --time=1-00:00:00           # max job run time dd-hh:mm:ss
#SBATCH --ntasks-per-node=1         # tasks (commands) per compute node
#SBATCH --cpus-per-task=48          # CPUs (threads) per command
#SBATCH --mem=488G                  # total memory per node
#SBATCH --output=stdout.%x.%j       # save stdout to file
#SBATCH --error=stderr.%x.%j        # save stderr to file

module purge
module load GCC/11.3.0  OpenMPI/4.1.4 ParaFold/2.0-CUDA-11.8.0

ALPHAFOLD_DATA_DIR=/scratch/data/bio/alphafold/2.3.2

protein_fasta='/scratch/data/bio/alphafold/example_data/1L2Y.fasta'
model_preset=multimer          # monomer, monomer_casp14, monomer_ptm, multimer
max_template_date=2024-1-1

jobstats &

# First, run CPU-only steps to get multiple sequence alignments:
run_alphafold.sh -d $ALPHAFOLD_DATA_DIR -o parafold_output_dir  -i $protein_fasta -p $model_preset -t $max_template_date -f

# Second, run GPU steps as a separate job after the first part completes successfully:
sbatch --job-name=parafold-gpu --time=2-00:00:00 --ntasks-per-node=1 --cpus-per-task=24 --mem=122G --gres=gpu:h100:1 --partition=gpu --output=stdout.%x.%j --error=stderr.%x.%j --dependency=afterok:$SLURM_JOBID<<EOF
#!/bin/bash
module purge
module load GCC/11.3.0  OpenMPI/4.1.4 ParaFold/2.0-CUDA-11.8.0
jobstats -i 1 &
echo "run_alphafold.sh -g -u 0 -d $ALPHAFOLD_DATA_DIR -o $pf_output_dir -p $model_preset -i $protein_fasta -t $max_template_date"
run_alphafold.sh -g -u 0 -d $ALPHAFOLD_DATA_DIR -o $pf_output_dir -p $model_preset -i $protein_fasta -t $max_template_date
# graph pLDDT and PAE .pkl files
run_AlphaPickle.py -od $pickle_out_dir
jobstats
EOF

jobstats

Example Grace Job Scripts

ParaFold is available on Grace and is the preferred approach to running AlphaFold because it avoids idle GPU usage.

The following is an example of using the Singularity image to run AlphaFold.

Example IL2Y amino acid sequence:

NLYIQWLKDGGPSSGRPPPS

Monomer

#!/bin/bash
#SBATCH --job-name=alphafold        # Grace job name  
#SBATCH --time=2-00:00:00           # max job run time dd-hh:mm:ss  
#SBATCH --ntasks-per-node=1         # tasks (commands) per compute node  
#SBATCH --cpus-per-task=24          # CPUs (threads) per command  
#SBATCH --mem=180G                  # total memory per node  
#SBATCH --gres=gpu:a100:1           # request 1 A100 GPU  
#SBATCH --output=stdout.%x.%j       # save stdout to file  
#SBATCH --error=stderr.%x.%j        # save stderr to file  

module purge  

# the next two lines can be added if your job runs out of GPU memory
export SINGULARITYENV_TF_FORCE_UNIFIED_MEMORY=1  
export SINGULARITYENV_XLA_PYTHON_CLIENT_MEM_FRACTION=4.0  

DOWNLOAD_DIR=/scratch/data/bio/alphafold/2.3.2  

# run gpustats in the background (&) to monitor gpu usage in order to create a graph later  
gpustats &  

singularity exec --nv /sw/hprc/sw/containers/alphafold/alphafold_2.3.2.sif python /app/alphafold/run_alphafold.py  \
  --data_dir=$DOWNLOAD_DIR  --use_gpu_relax \
  --uniref90_database_path=$DOWNLOAD_DIR/uniref90/uniref90.fasta  \
  --mgnify_database_path=$DOWNLOAD_DIR/mgnify/mgy_clusters_2022_05.fa  \
  --bfd_database_path=$DOWNLOAD_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt  \
  --uniref30_database_path=$DOWNLOAD_DIR/uniclust30/UniRef30_2022_02 \
  --pdb70_database_path=$DOWNLOAD_DIR/pdb70/pdb70  \
  --template_mmcif_dir=$DOWNLOAD_DIR/pdb_mmcif/mmcif_files  \
  --obsolete_pdbs_path=$DOWNLOAD_DIR/pdb_mmcif/obsolete.dat \
  --model_preset=monomer \
  --max_template_date=2024-1-1 \
  --db_preset=full_dbs \
  --output_dir=output_dir \
  --fasta_paths=IL2Y.fasta

# run gpustats to create a graph of gpu usage for this job  
gpustats

Multimer

#!/bin/bash
#SBATCH --job-name=alphafold        # Grace job name  
#SBATCH --time=2-00:00:00           # max job run time dd-hh:mm:ss  
#SBATCH --ntasks-per-node=1         # tasks (commands) per compute node  
#SBATCH --cpus-per-task=24          # CPUs (threads) per command  
#SBATCH --mem=180G                  # total memory per node  
#SBATCH --gres=gpu:a100:1           # request 1 A100 GPU  
#SBATCH --output=stdout.%x.%j       # save stdout to file  
#SBATCH --error=stderr.%x.%j        # save stderr to file  

module purge  

export SINGULARITYENV_TF_FORCE_UNIFIED_MEMORY=1  
export SINGULARITYENV_XLA_PYTHON_CLIENT_MEM_FRACTION=4.0  

DOWNLOAD_DIR=/scratch/data/bio/alphafold/2.3.2

# run gpustats in the background (&) to monitor gpu usage in order to create a graph later  
gpustats &  

singularity exec --nv /sw/hprc/sw/bio/containers/alphafold/alphafold_2.3.2.sif python /app/alphafold/run_alphafold.py  \  
 --use_gpu_relax \  
 --data_dir=$DOWNLOAD_DIR  \  
 --uniref90_database_path=$DOWNLOAD_DIR/uniref90/uniref90.fasta  \  
 --mgnify_database_path=$DOWNLOAD_DIR/mgnify/mgy_clusters_2022_05.fa  \  
 --bfd_database_path=$DOWNLOAD_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt  \  
 --uniref30_database_path=$DOWNLOAD_DIR/uniclust30/UniRef30_2022_02 \  
 --pdb_seqres_database_path=$DOWNLOAD_DIR/pdb_seqres/pdb_seqres.txt  \  
 --uniprot_database_path=$DOWNLOAD_DIR/uniprot/uniprot.fasta \
 --template_mmcif_dir=$DOWNLOAD_DIR/pdb_mmcif/mmcif_files  \  
 --obsolete_pdbs_path=$DOWNLOAD_DIR/pdb_mmcif/obsolete.dat \  
 --model_preset=multimer \  
 --max_template_date=2024-1-1 \  
 --db_preset=full_dbs \  
 --output_dir=out_alphafold \  
 --fasta_paths=/scratch/data/bio/alphafold/example_data/T1083_T1084_multimer.fasta  

# run gpustats to create a graph of gpu usage for this job  
gpustats

Multiple Monomers

If you have multiple protein sequences from the same organism, running multiple monomers in one job script will significantly speed up processing.

The following job script uses the AlphaFold/2.3.1 module instead of the singularity image.

The AlphaFold/2.3.1 module is usually one version behind the singularity image.

#!/bin/bash  
#SBATCH --job-name=alphafold_multimer       # Grace job name  
#SBATCH --time=2-00:00:00                   # max job run time dd-hh:mm:ss  
#SBATCH --ntasks-per-node=1                 # tasks (commands) per compute node  
#SBATCH --cpus-per-task=24                  # CPUs (threads) per command  
#SBATCH --mem=180G                          # total memory per node  
#SBATCH --gres=gpu:a100:1                   # request one a100 GPU  
#SBATCH --output=stdout.%x.%j               # save stdout to file  
#SBATCH --error=stderr.%x.%j                # save stderr to file

module purge
module load GCC/11.3.0  OpenMPI/4.1.4 AlphaFold/2.3.1-CUDA-11.7.0
module load AlphaPickle/1.4.1

DOWNLOAD_DIR=/scratch/data/bio/alphafold/2.3.2

# start a process to monitor GPU usage  
gpustats &

run_alphafold.py  \  
--data_dir=$DOWNLOAD_DIR  \  
--uniref90_database_path=$DOWNLOAD_DIR/uniref90/uniref90.fasta  \  
--mgnify_database_path=$DOWNLOAD_DIR/mgnify/mgy_clusters_2018_12.fa  \  
--bfd_database_path=$DOWNLOAD_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt  \  
--uniclust30_database_path=$DOWNLOAD_DIR/uniclust30/uniclust30_2021_03/UniRef30_2021_03 \  
--pdb70_database_path=$DOWNLOAD_DIR/pdb70/pdb70  \  
--template_mmcif_dir=$DOWNLOAD_DIR/pdb_mmcif/mmcif_files  \  
--obsolete_pdbs_path=$DOWNLOAD_DIR/pdb_mmcif/obsolete.dat \  
--model_preset=monomer \  
--max_template_date=2024-1-1 \  
--db_preset=full_dbs \  
--output_dir=out_alphafold_2.3.1_multi-monomers \  
--fasta_paths=T1083.fasta,T1084.fasta

# create a graph of GPU resource usage stats  
gpustats
  • The only parameter you need to change in the example job script is the path to your fasta file: --fasta_paths=
  • You can also change these: --max_template_date and --output_dir
  • Use the lines that have $DOWNLOAD_DIR exactly as they are and do not change these.

The run_alphafold.py command initially starts running on CPU but later only runs on a single GPU so specify one GPU in the #SBATCH --gres parameter and a fraction of the system memory and cores so someone else can use the the other GPU(s).

The example_data/T1050.fasta takes 16 hours on CPU on Grace.

GPU         Runtime
A100       4 hr  9 min
RTX 6000   3 hr 58 min
T4         5 hr 12 min

all GPUs used about 62GB of system memory

JAX memory allocation is explained here

Note: if you do not use the --nv option, AlphaFold will run on CPU only

Visualize Results

AlphaPickle can be used to create graphs for visualizing each of the model .pkl files.

#!/bin/bash  
#SBATCH --job-name=alphafold        # job name  
#SBATCH --time=2-00:00:00           # max job run time dd-hh:mm:ss  
#SBATCH --ntasks-per-node=1         # tasks (commands) per compute node  
#SBATCH --cpus-per-task=24          # CPUs (threads) per command  
#SBATCH --mem=180G                  # total memory per node  
#SBATCH --gres=gpu:a100:1           # request 1 A100 GPU  
#SBATCH --output=stdout.%x.%j       # save stdout to file  
#SBATCH --error=stderr.%x.%j        # save stderr to file

module purge  
module   load   GCC/10.2.0   CUDA/11.1.1 
OpenMPI/4.0.5   AlphaPickle/1.4.1

export SINGULARITYENV_TF_FORCE_UNIFIED_MEMORY=1  
export SINGULARITYENV_XLA_PYTHON_CLIENT_MEM_FRACTION=4.0

DOWNLOAD_DIR=/scratch/data/bio/alphafold/2.3.2

# run gpustats in the background (&) to monitor gpu usage in order to create a graph later  
gpustats &

singularity exec --nv /sw/hprc/sw/bio/containers/alphafold/alphafold_2.1.2.sif python /app/alphafold/run_alphafold.py  \ 
 --use_gpu_relax \  
 --data_dir=$DOWNLOAD_DIR  \  
 --uniref90_database_path=$DOWNLOAD_DIR/uniref90/uniref90.fasta  \  
 --mgnify_database_path=$DOWNLOAD_DIR/mgnify/mgy_clusters_2018_12.fa  \  
 --bfd_database_path=$DOWNLOAD_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt  \  
 --uniclust30_database_path=$DOWNLOAD_DIR/uniclust30/uniclust30_2021_03/UniRef30_2021_03 \  
 --pdb70_database_path=$DOWNLOAD_DIR/pdb70/pdb70  \  
 --template_mmcif_dir=$DOWNLOAD_DIR/pdb_mmcif/mmcif_files  \  
 --obsolete_pdbs_path=$DOWNLOAD_DIR/pdb_mmcif/obsolete.dat \  
 --model_preset=monomer \  
 --max_template_date=2022-1-1 \  
 --db_preset=full_dbs \  
 --output_dir=out_alphafold \  
 --fasta_paths=/scratch/data/bio/alphafold/example_data/T1050.fasta

# run gpustats to create a graph of gpu usage for this job  
gpustats

# run AlphaPickle to create a graph for each model .pkl file.  
# Name the -od directory based on how you named --output_dir and --fasta_paths in the run_alphafold.py command  
run_AlphaPickle.py   -od   out_alphafold/T1050

Database directory

The AlphaFold database directory is structured as follows: check for the latest version in /scratch/data/bio/alphafold/

/scratch/data/bio/alphafold/2.3.2/
├── bfd
│   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata
│   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex
│   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata
│   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex
│   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata
│   └── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex
├── example_data
│   ├── 1L2Y.fasta
│   ├── mmcif_3geh.fa
│   ├── T1083.fasta
│   └── T1083_T1084_multimer.fasta
├── mgnify
│   └── mgy_clusters_2022_05.fa
├── params
│   ├── LICENSE
│   ├── params_model_1_multimer_v3.npz
│   ├── params_model_1.npz
│   ├── params_model_1_ptm.npz
│   ├── params_model_2_multimer_v3.npz
│   ├── params_model_2.npz
│   ├── params_model_2_ptm.npz
│   ├── params_model_3_multimer_v3.npz
│   ├── params_model_3.npz
│   ├── params_model_3_ptm.npz
│   ├── params_model_4_multimer_v3.npz
│   ├── params_model_4.npz
│   ├── params_model_4_ptm.npz
│   ├── params_model_5_multimer_v3.npz
│   ├── params_model_5.npz
│   └── params_model_5_ptm.npz
├── pdb70
│   ├── md5sum
│   ├── pdb70_a3m.ffdata
│   ├── pdb70_a3m.ffindex
│   ├── pdb70_clu.tsv
│   ├── pdb70_cs219.ffdata
│   ├── pdb70_cs219.ffindex
│   ├── pdb70_hhm.ffdata
│   ├── pdb70_hhm.ffindex
│   └── pdb_filter.dat
├── pdb_mmcif
│   ├── mmcif_files
│   └── obsolete.dat
├── pdb_seqres
│   └── pdb_seqres.txt
├── small_bfd
│   └── bfd-first_non_consensus_sequences.fasta
├── uniprot
│   └── uniprot.fasta
├── uniref30
│   ├── UniRef30_2023_02_a3m.ffdata
│   ├── UniRef30_2023_02_a3m.ffindex
│   ├── UniRef30_2023_02_cs219.ffdata
│   ├── UniRef30_2023_02_cs219.ffindex
│   ├── UniRef30_2023_02_hhm.ffdata
│   ├── UniRef30_2023_02_hhm.ffindex
│   └── UniRef30_2023_02.md5sums
└── uniref90
    └── uniref90.fasta

Usage

If using the AlphaFold software module, run the following for usage:

run_alphafold.py --helpfull

If using the ParaFold software module, run the following for usage:

run_alphafold.sh --help

FAQ

Q. I'm seeing the message "RuntimeError: Resource exhausted: Out of memory"

A. make sure you have the following two lines in your AlphaFold job script. This is not needed with ParaFold since it is automatically configured in the ParaFold software module.

export SINGULARITYENV_TF_FORCE_UNIFIED_MEMORY=1  
export SINGULARITYENV_XLA_PYTHON_CLIENT_MEM_FRACTION=4.0

Q. I'm seeing the message: raise ValueError('The number of positions must match the number of atoms')

A. See this bug report


Q. How do I view the contents of a .pkl file such as result_model_2.pkl?

A. Use python to view the contents. More code needed to save to a file.

module purge
module load GCC/10.2.0  CUDA/11.1.1  OpenMPI/4.0.5  AlphaPickle/1.4.1
python
>>> import pandas as pd
>>> myresults=pd.read_pickle('/path/to/your/result_model_2.pkl')
>>> print(myresults)

Q. How do I create graphs of the .pkl model files

A. Use AlphaPickle which can be run on the login node after an AlphaFold job is complete since AlphaPickle takes less than a minute to complete.

module purge
module load GCC/10.2.0  CUDA/11.1.1  OpenMPI/4.0.5  AlphaPickle/1.4.1

# run AlphaPickle to create a graph for each model .pkl file.
# Name the -od directory based on how you named --output_dir and --fasta_paths in the run_alphafold.py command
run_AlphaPickle.py -od out_alphafold/T1050

Q. How can I extract ptm, iptm and ranking_confidence values from a pickle file such as result_model_2.pkl?

A. Use pickle2csv.py to extract the values for ptm, iptm and ranking_confidence when using --model_preset=monomer_ptm

If you used --model_preset=monomer then only ranking_confidence will be extracted

module purge
module load GCC/10.2.0  CUDA/11.1.1  OpenMPI/4.0.5  AlphaPickle/1.4.1
pickle2csv.py -i path/to/pickle/file/result_model_2.pkl -o output.csv

Q. Does AlphaFold run on the ACES PVC accelerators?

A. No. PVC accelerators are not supported by AlphaFold


Citation

  • If you use an AlphaFold prediction in your work, please cite the following papers:

Jumper, J et al. Highly accurate protein structure prediction with AlphaFold. Nature (2021).

Varadi, M et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Research (2021).

  • In addition, if you use the AlphaFold-Multimer mode, please cite:

Evans, R et al. Protein complex prediction with AlphaFold-Multimer, doi.org/10.1101/2021.10.04.463034

  • In addition, if you use ParaFold, please cite:

Bozitao Zhong, Xiaoming Su, Minhua Wen, Sichen Zuo, Liang Hong, James Lin. ParaFold: Paralleling AlphaFold for Large-Scale Predictions. 2021. arXiv:2111.06340. doi.org/10.48550/arXiv.2111.06340

  • In addition, if you use AlphaPickle, please cite:

Arnold, M. J. (2021) AlphaPickle, doi.org/10.5281/zenodo.5708709