Hprc banner tamu.png

Difference between revisions of "SW:AlphaFold"

From TAMU HPRC
Jump to: navigation, search
Line 17: Line 17:
  
 
Be sure to select a GPU when launching the VNC app.
 
Be sure to select a GPU when launching the VNC app.
  export SINGULARITY_BINDPATH="/scratch,$TMPDIR"
 
 
   singularity exec /sw/hprc/sw/bio/containers/alphafold_latest.sif python /app/alphafold/run_alphafold.py --helpfull
 
   singularity exec /sw/hprc/sw/bio/containers/alphafold_latest.sif python /app/alphafold/run_alphafold.py --helpfull
  

Revision as of 14:55, 4 October 2021

AlphaFold

GCATemplates available: no

Description

AlphaFold is an AI system that predicts a protein’s 3D structure from its amino acid sequence.

AlhpaFold homepage

AlphaFold is available on Grace as a singularity container based on catgumag/alphafold

/sw/hprc/sw/bio/containers/alphafold_latest.sif

The AlphaFold databases are found in the following directory

 /scratch/data/bio/alphafold/

Run alphafold using the following command while using the VNC portal app or from within your job script. Tested on A100 GPU.

Be sure to select a GPU when launching the VNC app.

 singularity exec /sw/hprc/sw/bio/containers/alphafold_latest.sif python /app/alphafold/run_alphafold.py --helpfull

See the Biocontainers wiki page for additional details on how to use .sif image files.

Example Job Script

#!/bin/bash
#SBATCH --export=NONE               # do not export current env to the job
#SBATCH --job-name=alphafold        # job name
#SBATCH --time=1-00:00:00           # max job run time dd-hh:mm:ss
#SBATCH --ntasks-per-node=1         # tasks (commands) per compute node
#SBATCH --cpus-per-task=24          # CPUs (threads) per command
#SBATCH --mem=180G                  # total memory per node
#SBATCH --gres=gpu:a100:1           # request 1 A100 GPU
#SBATCH --output=stdout.%j          # save stdout to file
#SBATCH --error=stderr.%j           # save stderr to file

DOWNLOAD_DIR=/scratch/data/bio/alphafold

singularity exec --nv /sw/hprc/sw/bio/containers/alphafold_latest.sif python /app/alphafold/run_alphafold.py  \
  --fasta_paths=/scratch/data/bio/alphafold/example_data/T1050.fasta  \
  --data_dir=$DOWNLOAD_DIR  \
  --uniref90_database_path=$DOWNLOAD_DIR/uniref90/uniref90.fasta  \
  --mgnify_database_path=$DOWNLOAD_DIR/mgnfy/mgy_clusters_2018_12.fa  \
  --bfd_database_path=$DOWNLOAD_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt  \
  --uniclust30_database_path=$DOWNLOAD_DIR/uniclust30/uniclust30_2018_08/uniclust30_2018_08  \
  --pdb70_database_path=$DOWNLOAD_DIR/pdb70/pdb70  \
  --template_mmcif_dir=$DOWNLOAD_DIR/pdb_mmcif/mmcif_files  \
  --obsolete_pdbs_path=$DOWNLOAD_DIR/pdb_mmcif/obsolete.dat  \
  --output_dir=out_alphafold  \
  --model_names=model_1,model_2,model_3,model_4,model_5  \
  --max_template_date=2020-1-1

The run_alphafold.py command initially starts running on CPU but later only runs on a single GPU so specify one GPU in the #SBATCH --gres parameter and half of the system memory and cores so someone else can use the the other GPU.

The example_data/T1050.fasta takes 16 hours on CPU and 6 hours on one A100 GPU and uses 75GB memory.

Note: if you do not use the --nv option, AlphaFold will run on CPU only

Database directory

The AlphaFold database directory is structured as follows:

/scratch/data/bio/alphafold/
├── bfd
│   ├── [1.4T]  bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata
│   ├── [1.7G]  bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex
│   ├── [ 16G]  bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata
│   ├── [1.6G]  bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex
│   ├── [304G]  bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata
│   └── [124M]  bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex
├── example_data
│   └── [ 828]  T1050.fasta
├── mgnfy
│   └── [ 64G]  mgy_clusters_2018_12.fa
├── params
│   ├── [ 20K]  LICENSE
│   ├── [356M]  params_model_1.npz
│   ├── [356M]  params_model_1_ptm.npz
│   ├── [356M]  params_model_2.npz
│   ├── [356M]  params_model_2_ptm.npz
│   ├── [354M]  params_model_3.npz
│   ├── [355M]  params_model_3_ptm.npz
│   ├── [354M]  params_model_4.npz
│   ├── [355M]  params_model_4_ptm.npz
│   ├── [354M]  params_model_5.npz
│   └── [355M]  params_model_5_ptm.npz
├── pdb70
│   ├── [ 410]  md5sum
│   ├── [ 53G]  pdb70_a3m.ffdata
│   ├── [2.0M]  pdb70_a3m.ffindex
│   ├── [6.6M]  pdb70_clu.tsv
│   ├── [ 21M]  pdb70_cs219.ffdata
│   ├── [1.5M]  pdb70_cs219.ffindex
│   ├── [ 19G]  pdb70_from_mmcif_200401.tar.gz
│   ├── [3.2G]  pdb70_hhm.ffdata
│   ├── [1.8M]  pdb70_hhm.ffindex
│   └── [ 19M]  pdb_filter.dat
├── pdb_mmcif
│   ├── [9.0M]  mmcif_files/
│   └── [139K]  obsolete.dat
├── small_bfd
│   └── [ 17G]  bfd-first_non_consensus_sequences.fasta
├── uniclust30
│   └── [ 87G]  uniclust30_2018_08/
└── uniref90
    └── [ 58G]  uniref90.fasta