Skip to content

AlphaFold

GCATemplates

Description

AlphaFold is an AI system that predicts a protein’s 3D structure from its amino acid sequence.

AlphaFold homepage

AlphaFold 2.3.1+ is available on Grace as a singularity container created from tacc/alphafold

/sw/hprc/sw/bio/containers/alphafold/alphafold_2.3.1.sif

The AlphaFold 2.3.0 database and newer versions are found in the following directory

/scratch/data/bio/alphafold/

You can either run alphafold using the following command while using the VNC portal app or from within your job script.

Be sure to select a GPU when launching the VNC app.

singularity exec /sw/hprc/sw/containers/alphafold/alphafold_2.3.1.sif python /app/alphafold/run_alphafold.py --helpfull

See the Biocontainers page for additional details on how to use .sif image files.

Example Job Scripts

Example IL2Y amino acid sequence:

NLYIQWLKDGGPSSGRPPPS

Monomer

#!/bin/bash
#SBATCH --job-name=alphafold        # Grace job name  
#SBATCH --time=2-00:00:00           # max job run time dd-hh:mm:ss  
#SBATCH --ntasks-per-node=1         # tasks (commands) per compute node  
#SBATCH --cpus-per-task=24          # CPUs (threads) per command  
#SBATCH --mem=180G                  # total memory per node  
#SBATCH --gres=gpu:a100:1           # request 1 A100 GPU  
#SBATCH --output=stdout.%x.%j       # save stdout to file  
#SBATCH --error=stderr.%x.%j        # save stderr to file  

module purge  

# the next two lines can be added if your job runs out of GPU memory
export SINGULARITYENV_TF_FORCE_UNIFIED_MEMORY=1  
export SINGULARITYENV_XLA_PYTHON_CLIENT_MEM_FRACTION=4.0  

DOWNLOAD_DIR=/scratch/data/bio/alphafold/2.3.0  

# run gpustats in the background (&) to monitor gpu usage in order to create a graph later  
gpustats &  

singularity exec --nv /sw/hprc/sw/containers/alphafold/alphafold_2.3.1.sif python /app/alphafold/run_alphafold.py  \
  --data_dir=$DOWNLOAD_DIR  --use_gpu_relax \
  --uniref90_database_path=$DOWNLOAD_DIR/uniref90/uniref90.fasta  \
  --mgnify_database_path=$DOWNLOAD_DIR/mgnify/mgy_clusters_2022_05.fa  \
  --bfd_database_path=$DOWNLOAD_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt  \
  --uniref30_database_path=$DOWNLOAD_DIR/uniclust30/UniRef30_2022_02 \
  --pdb70_database_path=$DOWNLOAD_DIR/pdb70/pdb70  \
  --template_mmcif_dir=$DOWNLOAD_DIR/pdb_mmcif/mmcif_files  \
  --obsolete_pdbs_path=$DOWNLOAD_DIR/pdb_mmcif/obsolete.dat \
  --model_preset=monomer \
  --max_template_date=2023-1-1 \
  --db_preset=full_dbs \
  --output_dir=output_dir \
  --fasta_paths=IL2Y.fasta

# run gpustats to create a graph of gpu usage for this job  
gpustats

Multimer

#!/bin/bash
#SBATCH --job-name=alphafold        # Grace job name  
#SBATCH --time=2-00:00:00           # max job run time dd-hh:mm:ss  
#SBATCH --ntasks-per-node=1         # tasks (commands) per compute node  
#SBATCH --cpus-per-task=24          # CPUs (threads) per command  
#SBATCH --mem=180G                  # total memory per node  
#SBATCH --gres=gpu:a100:1           # request 1 A100 GPU  
#SBATCH --output=stdout.%x.%j       # save stdout to file  
#SBATCH --error=stderr.%x.%j        # save stderr to file  

module purge  

export SINGULARITYENV_TF_FORCE_UNIFIED_MEMORY=1  
export SINGULARITYENV_XLA_PYTHON_CLIENT_MEM_FRACTION=4.0  

DOWNLOAD_DIR=/scratch/data/bio/alphafold/2.3.0

# run gpustats in the background (&) to monitor gpu usage in order to create a graph later  
gpustats &  

singularity exec --nv /sw/hprc/sw/bio/containers/alphafold/alphafold_2.3.1.sif python /app/alphafold/run_alphafold.py  \  
 --use_gpu_relax \  
 --data_dir=$DOWNLOAD_DIR  \  
 --uniref90_database_path=$DOWNLOAD_DIR/uniref90/uniref90.fasta  \  
 --mgnify_database_path=$DOWNLOAD_DIR/mgnify/mgy_clusters_2022_05.fa  \  
 --bfd_database_path=$DOWNLOAD_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt  \  
 --uniref30_database_path=$DOWNLOAD_DIR/uniclust30/UniRef30_2022_02 \  
 --pdb_seqres_database_path=$DOWNLOAD_DIR/pdb_seqres/pdb_seqres.txt  \  
 --uniprot_database_path=$DOWNLOAD_DIR/uniprot/uniprot.fasta \
 --template_mmcif_dir=$DOWNLOAD_DIR/pdb_mmcif/mmcif_files  \  
 --obsolete_pdbs_path=$DOWNLOAD_DIR/pdb_mmcif/obsolete.dat \  
 --model_preset=multimer \  
 --max_template_date=2023-1-1 \  
 --db_preset=full_dbs \  
 --output_dir=out_alphafold \  
 --fasta_paths=/scratch/data/bio/alphafold/example_data/T1083_T1084_multimer.fasta  

# run gpustats to create a graph of gpu usage for this job  
gpustats

Multiple Monomers

If you have multiple protein sequences from the same organism, running multiple monomers in one job script will significantly speed up processing.

The following job script uses the AlphaFold/2.1.1 module instead of the singularity image.

The AlphaFold/2.1.1 module is usually one version behind the singularity image.

#!/bin/bash  
#SBATCH --job-name=alphafold_multi-monomer  # Grace job name  
#SBATCH --time=2-00:00:00                   # max job run time dd-hh:mm:ss  
#SBATCH --ntasks-per-node=1                 # tasks (commands) per compute node  
#SBATCH --cpus-per-task=24                  # CPUs (threads) per command  
#SBATCH --mem=180G                          # total memory per node  
#SBATCH --gres=gpu:a100:1                   # request one a100 GPU  
#SBATCH --output=stdout.%x.%j               # save stdout to file  
#SBATCH --error=stderr.%x.%j                # save stderr to file

module purge  
module load GCC/10.2.0  CUDA/11.1.1  OpenMPI/4.0.5  AlphaFold/2.1.1

DOWNLOAD_DIR=/scratch/data/bio/alphafold/2.1.1

# start a process to monitor GPU usage  
gpustats &

run_alphafold.py  \  
--data_dir=$DOWNLOAD_DIR  \  
--uniref90_database_path=$DOWNLOAD_DIR/uniref90/uniref90.fasta  \  
--mgnify_database_path=$DOWNLOAD_DIR/mgnify/mgy_clusters_2018_12.fa  \  
--bfd_database_path=$DOWNLOAD_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt  \  
--uniclust30_database_path=$DOWNLOAD_DIR/uniclust30/uniclust30_2021_03/UniRef30_2021_03 \  
--pdb70_database_path=$DOWNLOAD_DIR/pdb70/pdb70  \  
--template_mmcif_dir=$DOWNLOAD_DIR/pdb_mmcif/mmcif_files  \  
--obsolete_pdbs_path=$DOWNLOAD_DIR/pdb_mmcif/obsolete.dat \  
--model_preset=monomer \  
--max_template_date=2022-1-1 \  
--db_preset=full_dbs \  
--output_dir=out_alphafold_2.1.1_multi-monomer \  
--fasta_paths=T1083.fasta,T1084.fasta

# create a graph of GPU resource usage stats  
gpustats
  • The only parameter you need to change in the example job script is the path to your fasta file: --fasta_paths=
  • You can also change these: --max_template_date and --output_dir
  • Use the lines that have $DOWNLOAD_DIR exactly as they are and do not change these.

The run_alphafold.py command initially starts running on CPU but later only runs on a single GPU so specify one GPU in the #SBATCH --gres parameter and a fraction of the system memory and cores so someone else can use the the other GPU(s).

The example_data/T1050.fasta takes 16 hours on CPU on Grace.

GPU         Runtime
A100       4 hr  9 min
RTX 6000   3 hr 58 min
T4         5 hr 12 min

all GPUs used about 62GB of system memory

JAX memory allocation is explained here

Note: if you do not use the --nv option, AlphaFold will run on CPU only

Visualize Results

AlphaPickle can be used to create graphs for visualizing each of the model .pkl files.

#!/bin/bash  
#SBATCH --job-name=alphafold        # job name  
#SBATCH --time=2-00:00:00           # max job run time dd-hh:mm:ss  
#SBATCH --ntasks-per-node=1         # tasks (commands) per compute node  
#SBATCH --cpus-per-task=24          # CPUs (threads) per command  
#SBATCH --mem=180G                  # total memory per node  
#SBATCH --gres=gpu:a100:1           # request 1 A100 GPU  
#SBATCH --output=stdout.%x.%j       # save stdout to file  
#SBATCH --error=stderr.%x.%j        # save stderr to file

module purge  
module   load   GCC/10.2.0   CUDA/11.1.1 
OpenMPI/4.0.5   AlphaPickle/1.4.1

export SINGULARITYENV_TF_FORCE_UNIFIED_MEMORY=1  
export SINGULARITYENV_XLA_PYTHON_CLIENT_MEM_FRACTION=4.0

DOWNLOAD_DIR=/scratch/data/bio/alphafold/2.1.2

# run gpustats in the background (&) to monitor gpu usage in order to create a graph later  
gpustats &

singularity exec --nv /sw/hprc/sw/bio/containers/alphafold/alphafold_2.1.2.sif python /app/alphafold/run_alphafold.py  \ 
 --use_gpu_relax \  
 --data_dir=$DOWNLOAD_DIR  \  
 --uniref90_database_path=$DOWNLOAD_DIR/uniref90/uniref90.fasta  \  
 --mgnify_database_path=$DOWNLOAD_DIR/mgnify/mgy_clusters_2018_12.fa  \  
 --bfd_database_path=$DOWNLOAD_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt  \  
 --uniclust30_database_path=$DOWNLOAD_DIR/uniclust30/uniclust30_2021_03/UniRef30_2021_03 \  
 --pdb70_database_path=$DOWNLOAD_DIR/pdb70/pdb70  \  
 --template_mmcif_dir=$DOWNLOAD_DIR/pdb_mmcif/mmcif_files  \  
 --obsolete_pdbs_path=$DOWNLOAD_DIR/pdb_mmcif/obsolete.dat \  
 --model_preset=monomer \  
 --max_template_date=2022-1-1 \  
 --db_preset=full_dbs \  
 --output_dir=out_alphafold \  
 --fasta_paths=/scratch/data/bio/alphafold/example_data/T1050.fasta

# run gpustats to create a graph of gpu usage for this job  
gpustats

# run AlphaPickle to create a graph for each model .pkl file.  
# Name the -od directory based on how you named --output_dir and --fasta_paths in the run_alphafold.py command  
run_AlphaPickle.py   -od   out_alphafold/T1050

Database directory

The AlphaFold database directory is structured as follows: check for the latest version in /scratch/data/bio/alphafold/

/scratch/data/bio/alphafold/2.3.0/
├── [ 4.0K]  bfd
│   ├── [ 1.4T]  bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata
│   ├── [ 1.7G]  bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex
│   ├── [  16G]  bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata
│   ├── [ 1.6G]  bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex
│   ├── [ 304G]  bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata
│   └── [ 124M]  bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex
├── [ 4.0K]  example_data
│   ├── [   27]  1L2Y.fasta
│   ├── [   27]  IL2Y.fasta
│   ├── [  770]  mmcif_3geh.fa
│   ├── [  106]  T1083.fasta
│   └── [  187]  T1083_T1084_multimer.fasta
├── [ 4.0K]  mgnify
│   └── [ 120G]  mgy_clusters_2022_05.fa
├── [ 4.0K]  params
│   ├── [  18K]  LICENSE
│   ├── [ 356M]  params_model_1_multimer_v3.npz
│   ├── [ 356M]  params_model_1.npz
│   ├── [ 356M]  params_model_1_ptm.npz
│   ├── [ 356M]  params_model_2_multimer_v3.npz
│   ├── [ 356M]  params_model_2.npz
│   ├── [ 356M]  params_model_2_ptm.npz
│   ├── [ 356M]  params_model_3_multimer_v3.npz
│   ├── [ 354M]  params_model_3.npz
│   ├── [ 355M]  params_model_3_ptm.npz
│   ├── [ 356M]  params_model_4_multimer_v3.npz
│   ├── [ 354M]  params_model_4.npz
│   ├── [ 355M]  params_model_4_ptm.npz
│   ├── [ 356M]  params_model_5_multimer_v3.npz
│   ├── [ 354M]  params_model_5.npz
│   └── [ 355M]  params_model_5_ptm.npz
├── [ 4.0K]  pdb70
│   ├── [  410]  md5sum
│   ├── [  53G]  pdb70_a3m.ffdata
│   ├── [ 2.0M]  pdb70_a3m.ffindex
│   ├── [ 6.6M]  pdb70_clu.tsv
│   ├── [  21M]  pdb70_cs219.ffdata
│   ├── [ 1.5M]  pdb70_cs219.ffindex
│   ├── [ 3.2G]  pdb70_hhm.ffdata
│   ├── [ 1.8M]  pdb70_hhm.ffindex
│   └── [  19M]  pdb_filter.dat
├── [ 4.0K]  pdb_mmcif
│   ├── [  10M]  mmcif_files
│   └── [ 143K]  obsolete.dat
├── [ 4.0K]  pdb_seqres
│   └── [ 230M]  pdb_seqres.txt
├── [ 4.0K]  uniclust30
│   ├── [ 179G]  UniRef30_2022_02_a3m.ffdata
│   ├── [ 830M]  UniRef30_2022_02_a3m.ffindex
│   ├── [ 7.5G]  UniRef30_2022_02_cs219.ffdata
│   ├── [ 749M]  UniRef30_2022_02_cs219.ffindex
│   ├── [  42G]  UniRef30_2022_02_hhm.ffdata
│   ├── [  24M]  UniRef30_2022_02_hhm.ffindex
│   └── [  379]  UniRef30_2022_02.md5sums
├── [ 4.0K]  uniprot
│   └── [ 104G]  uniprot.fasta
├── [ 4.0K]  uniref30
│   ├── [ 160G]  UniRef30_2021_03_a3m.ffdata
│   ├── [ 758M]  UniRef30_2021_03_a3m.ffindex
│   ├── [ 6.7G]  UniRef30_2021_03_cs219.ffdata
│   ├── [ 683M]  UniRef30_2021_03_cs219.ffindex
│   ├── [  38G]  UniRef30_2021_03_hhm.ffdata
│   ├── [  22M]  UniRef30_2021_03_hhm.ffindex
│   └── [  379]  UniRef30_2021_03.md5sums
└── [ 4.0K]  uniref90
    └── [  67G]  uniref90.fasta

2.3.1 usage

Full AlphaFold protein structure prediction script.
flags:

/app/alphafold/run_alphafold.py:
  --[no]benchmark: Run multiple JAX model evaluations to obtain a timing that excludes the compilation time, which should be more indicative of the time required for inferencing many proteins.
    (default: 'false')
  --bfd_database_path: Path to the BFD database for use by HHblits.
  --data_dir: Path to directory of supporting data.
  --db_preset: <full_dbs|reduced_dbs>: Choose preset MSA database configuration - smaller genetic database config (reduced_dbs) or full genetic database config  (full_dbs)
    (default: 'full_dbs')
  --fasta_paths: Paths to FASTA files, each containing a prediction target that will be folded one after another. If a FASTA file contains multiple sequences, then it will be folded as a multimer. Paths should be separated by
    commas. All FASTA paths must have a unique basename as the basename is used to name the output directories for each prediction.
    (a comma separated list)
  --hhblits_binary_path: Path to the HHblits executable.
    (default: '/usr/bin/hhblits')
  --hhsearch_binary_path: Path to the HHsearch executable.
    (default: '/usr/bin/hhsearch')
  --hmmbuild_binary_path: Path to the hmmbuild executable.
    (default: '/usr/bin/hmmbuild')
  --hmmsearch_binary_path: Path to the hmmsearch executable.
    (default: '/usr/bin/hmmsearch')
  --jackhmmer_binary_path: Path to the JackHMMER executable.
    (default: '/usr/bin/jackhmmer')
  --kalign_binary_path: Path to the Kalign executable.
    (default: '/usr/bin/kalign')
  --max_template_date: Maximum template release date to consider. Important if folding historical test sets.
  --mgnify_database_path: Path to the MGnify database for use by JackHMMER.
  --model_preset: <monomer|monomer_casp14|monomer_ptm|multimer>: Choose preset model configuration - the monomer model, the monomer model with extra ensembling, monomer model with pTM head, or multimer model
    (default: 'monomer')
  --num_multimer_predictions_per_model: How many predictions (each with a different random seed) will be generated per model. E.g. if this is 2 and there are 5 models then there will be 10 predictions per input. Note: this
    FLAG only applies if model_preset=multimer
    (default: '5')
    (an integer)
  --obsolete_pdbs_path: Path to file containing a mapping from obsolete PDB IDs to the PDB IDs of their replacements.
  --output_dir: Path to a directory that will store the results.
  --pdb70_database_path: Path to the PDB70 database for use by HHsearch.
  --pdb_seqres_database_path: Path to the PDB seqres database for use by hmmsearch.
  --random_seed: The random seed for the data pipeline. By default, this is randomly generated. Note that even if this is set, Alphafold may still not be deterministic, because processes like GPU inference are
    nondeterministic.
    (an integer)
  --[no]run_relax: Whether to run the final relaxation step on the predicted models. Turning relax off might result in predictions with distracting stereochemical violations but might help in case you are having issues with
    the relaxation stage.
    (default: 'true')
  --small_bfd_database_path: Path to the small version of BFD used with the "reduced_dbs" preset.
  --template_mmcif_dir: Path to a directory with template mmCIF structures, each named <pdb_id>.cif
  --uniprot_database_path: Path to the Uniprot database for use by JackHMMer.
  --uniref30_database_path: Path to the UniRef30 database for use by HHblits.
  --uniref90_database_path: Path to the Uniref90 database for use by JackHMMER.
  --[no]use_gpu_relax: Whether to relax on GPU. Relax on GPU can be much faster than CPU, so it is recommended to enable if possible. GPUs must be available if this setting is enabled.
  --[no]use_precomputed_msas: Whether to read MSAs that have been written to disk instead of running the MSA tools. The MSA files are looked up in the output directory, so it must stay the same between multiple runs that are
    to reuse the MSAs. WARNING: This will not check if the sequence, database or configuration have changed.
    (default: 'false')

absl.app:
  -?,--[no]help: show this help
    (default: 'false')
  --[no]helpfull: show full help
    (default: 'false')
  --[no]helpshort: show this help
    (default: 'false')
  --[no]helpxml: like --helpfull, but generates XML output
    (default: 'false')
  --[no]only_check_args: Set to true to validate args and exit.
    (default: 'false')
  --[no]pdb: Alias for --pdb_post_mortem.
    (default: 'false')
  --[no]pdb_post_mortem: Set to true to handle uncaught exceptions with PDB post mortem.
    (default: 'false')
  --profile_file: Dump profile information to a file (for python -m pstats). Implies --run_with_profiling.
  --[no]run_with_pdb: Set to true for PDB debug mode
    (default: 'false')
  --[no]run_with_profiling: Set to true for profiling the script. Execution will be slower, and the output format might change over time.
    (default: 'false')
  --[no]use_cprofile_for_profiling: Use cProfile instead of the profile module for profiling. This has no effect unless --run_with_profiling is set.
    (default: 'true')

absl.logging:
  --[no]alsologtostderr: also log to stderr?
    (default: 'false')
  --log_dir: directory to write logfiles into
    (default: '')
  --logger_levels: Specify log level of loggers. The format is a CSV list of 'name:level'. Where 'name' is the logger name used with 'logging.getLogger()', and 'level' is a level name  (INFO, DEBUG, etc). e.g.
    'myapp.foo:INFO,other.logger:DEBUG'
    (default: '')
  --[no]logtostderr: Should only log to stderr?
    (default: 'false')
  --[no]showprefixforinfo: If False, do not prepend prefix to info messages when it's logged to stderr, --verbosity is set to INFO level, and python logging is used.
    (default: 'true')
  --stderrthreshold: log messages at this level, or more severe, to stderr in addition to the logfile.  Possible values are 'debug', 'info', 'warning', 'error', and 'fatal'.  Obsoletes --alsologtostderr. Using
    --alsologtostderr cancels the effect of this flag. Please also note that this flag is subject to --verbosity and requires logfile not be stderr.
    (default: 'fatal')
  -v,--verbosity: Logging verbosity level. Messages logged at this level or lower will be included. Set to 1 for debug logging. If the flag was not set or supplied, the value will be changed from the default of -1 (warning) to
    0 (info) after flags are parsed.
    (default: '-1')
    (an integer)

absl.testing.absltest:
  --test_random_seed: Random seed for testing. Some test frameworks may change the default value of this flag between runs, so it is not appropriate for seeding probabilistic tests.
    (default: '301')
    (an integer)
  --test_randomize_ordering_seed: If positive, use this as a seed to randomize the execution order for test cases. If "random", pick a random seed to use. If 0 or not set, do not randomize test case execution order. This flag
    also overrides the TEST_RANDOMIZE_ORDERING_SEED environment variable.
    (default: '')
  --test_srcdir: Root of directory tree where source files live
    (default: '')
  --test_tmpdir: Directory for temporary testing files
    (default: '/tmp/absl_testing')
  --xml_output_file: File to store XML test results
    (default: '')

tensorflow.python.ops.parallel_for.pfor:
  --[no]op_conversion_fallback_to_while_loop: DEPRECATED: Flag is ignored.
    (default: 'true')

tensorflow.python.tpu.client.client:
  --[no]hbm_oom_exit: Exit the script when the TPU HBM is OOM.
    (default: 'true')
  --[no]runtime_oom_exit: Exit the script when the TPU runtime is OOM.
    (default: 'true')

absl.flags:
  --flagfile: Insert flag definitions from the given file into the command line.
    (default: '')
  --undefok: comma-separated list of flag names that it is okay to specify on the command line even if the program does not define a flag with that name.  IMPORTANT: flags in this list that have arguments MUST use the
    --flag=value format.
    (default: '')

FAQ

Q. I'm seeing the message "RuntimeError: Resource exhausted: Out of memory"

A. make sure you have the following two lines in your job script

export SINGULARITYENV_TF_FORCE_UNIFIED_MEMORY=1  
export SINGULARITYENV_XLA_PYTHON_CLIENT_MEM_FRACTION=4.0

Q. I'm seeing the message: raise ValueError('The number of positions must match the number of atoms')

A. See this bug report


Q. How do I view the contents of a .pkl file such as result_model_2.pkl?

A. Use python to view the contents. More code needed to save to a file.

module purge
module load GCC/10.2.0  CUDA/11.1.1  OpenMPI/4.0.5  AlphaPickle/1.4.1
python
>>> import pandas as pd
>>> myresults=pd.read_pickle('/path/to/your/result_model_2.pkl')
>>> print(myresults)

Q. How do I create graphs of the .pkl model files

A. Use AlphaPickle which can be run on the login node after an AlphaFold job is complete since AlphaPickle takes less than a minute to complete.

module purge
module load GCC/10.2.0  CUDA/11.1.1  OpenMPI/4.0.5  AlphaPickle/1.4.1

# run AlphaPickle to create a graph for each model .pkl file.
# Name the -od directory based on how you named --output_dir and --fasta_paths in the run_alphafold.py command
run_AlphaPickle.py -od out_alphafold/T1050

Q. How can I extract ptm, iptm and ranking_confidence values from a pickle file such as result_model_2.pkl?

A. Use pickle2csv.py to extract the values for ptm, iptm and ranking_confidence when using --model_preset=monomer_ptm

If you used --model_preset=monomer then only ranking_confidence will be extracted

module purge
module load GCC/10.2.0  CUDA/11.1.1  OpenMPI/4.0.5  AlphaPickle/1.4.1
pickle2csv.py -i path/to/pickle/file/result_model_2.pkl -o output.csv

Citation

  • If you use an AlphaFold prediction in your work, please cite the following papers:

Jumper, J et al. Highly accurate protein structure prediction with AlphaFold. Nature (2021).

Varadi, M et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Research (2021).

  • In addition, if you use the AlphaFold-Multimer mode, please cite:

Evans, R et al. Protein complex prediction with AlphaFold-Multimer, doi.org/10.1101/2021.10.04.463034

  • AlphaPickle

Arnold, M. J. (2021) AlphaPickle, doi.org/10.5281/zenodo.5708709

ParallelFold

ParallelFold can help you accelerate AlphaFold when you want to predict multiple sequences. After dividing the CPU part and GPU part, users can finish feature step by multiple processors.

Using ParallelFold, the first three steps, jackhammer, jackhammer and HHblits are supposed to run as three separate processes in parallel but currently this parallelization step is not implemented in the ParallelFold repo. When this step is parallelized, it will require 20 cores; 8 cores each for the two jackhammer steps and 5 cores for the HHblits step.

Using ParallelFold, you can run AlphaFold 2~3 times faster than DeepMind's procedure.

Example T1050.fasta file took 6 hrs 6 min to complete the CPU step and 39 minutes for the GPU step

first step on CPU

Currently recommended to use 8 cpus-per-task and 60G memory until ParallelFold is parallelized.

#!/bin/bash  
#SBATCH --export=NONE                # do not export current env to the job  
#SBATCH --job-name=parallelfold      # job name  
#SBATCH --time=1-00:00:00            # max job run time dd-hh:mm:ss  
#SBATCH --ntasks-per-node=1          # tasks (commands) per compute node  
#SBATCH --cpus-per-task=24           # CPUs (threads) per command  
#SBATCH --mem=180G                   # total memory per node  
#SBATCH --output=stdout.%x.%j        # save stdout to file  
#SBATCH --error=stderr.%x.%j         # save stderr to file

module load ParallelFold/2.1.1  
source activate /sw/hprc/sw/Anaconda3/2021.11/envs/parallelfold_2.1.1

DOWNLOAD_DIR=/scratch/data/bio/alphafold/2.1.1

# First, you need CPUs to run get features, run only feature step on CPU by using option: -f  
run_alphafold.sh -f -d $DOWNLOAD_DIR -o out_parallelfold_monomer -p monomer -t 2022-01-01 -i T1050.fasta

second step on GPU

Add the #SBATCH --gres line to run this part on a GPU compute node.

#!/bin/bash  
#SBATCH --export=NONE                # do not export current env to the job  
#SBATCH --job-name=parallelfold      # job name  
#SBATCH --time=08:00:00              # max job run time dd-hh:mm:ss  
#SBATCH --ntasks-per-node=1          # tasks (commands) per compute node  
#SBATCH --cpus-per-task=24           # CPUs (threads) per command  
#SBATCH --mem=180G                   # total memory per node  
#SBATCH --gres=gpu:a100:1            # request 1 A100 GPU  
#SBATCH --output=stdout.%x.%j        # save stdout to file  
#SBATCH --error=stderr.%x.%j         # save stderr to file

module load ParallelFold/2.1.1  
source activate /sw/hprc/sw/Anaconda3/2021.11/envs/parallelfold_2.1.1

export TF_FORCE_UNIFIED_MEMORY=1  
export XLA_PYTHON_CLIENT_MEM_FRACTION=4.0

DOWNLOAD_DIR=/scratch/data/bio/alphafold/2.1.1

# second step runs on GPU  
run_alphafold.sh -d $DOWNLOAD_DIR -u $CUDA_VISIBLE_DEVICES -o out_parallelfold_monomer -p monomer -t 2022-01-01 -i T1050.fasta

third step visualize

This step does not work yet. It is still in development by ParallelFold group

module purge
module load ParallelFold/2021.10.28
module load GCC/9.3.0 OpenMPI/4.0.3 matplotlib/3.2.1-Python-3.8.2
module load pymol-open-source/2.5.0-Python-3.8.2

run_figure.py [SystemName]
Back to top