Hprc banner tamu.png

Bioinformatics:Proteomics

From TAMU HPRC
Revision as of 15:40, 14 October 2021 by Cmdickens (talk | contribs) (Proteomics)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Proteomics

Back to Bioinformatics Main Menu

InterProScan

GCATemplates available: grace

InterProScan homepage

 module spider InterProScan

Additional notes on how to run InterProscan

InterPro is a resource that provides functional analysis of protein sequences by classifying them into families and predicting the presence of domains and important sites. To classify proteins in this way, InterPro uses predictive models, known as signatures, provided by several different databases (referred to as member databases) that make up the InterPro consortium.

You can see the InterProscan options with the following command.

interproscan.sh

Options specified with the interproscan.sh command will override the pre-configured interproscan.properties configuration file.

The InterProscan Match Lookup Service is only working in version InterProScan/5.53-87.0-Python-3.8.2 and later.

I-TASSER

GCATemplates available: no

I-TASSER is only for non-commercial use.

I-TASSER website

I-TASSER (Iterative Threading ASSEmbly Refinement) is a hierarchical approach to protein structure prediction and structure-based function annotation.

Available on Grace only.

module load GCC/9.3.0  I-TASSER/5.1-Python-3.8.2

The I-TASSER data libraries are in the following directory

/scratch/data/bio/i-tasser/5.1

Although some of the I-TASSER libraries (except nr) are updated weekly on the I-TASSER website, the libraries on Grace will be updated at each cluster maintenance.

The nr database will also be updated at each Grace cluster maintenance.

If you would like to use the weekly updated databases, you can download these to a directory in your $SCRATCH directory and symlink create symlinks to the databases that are not updated weekly such as nr. All of the I-TASSER databases (not including the nr database) are about 85GB total

runstyle

parallel

  • example command:
    • runI-TASSER.pl -java_home $EBROOTJAVA -runstyle parallel -datadir my_datadir -libdir /scratch/data/bio/i-tasser/5.1 -seqname my_seq_name
  • All jobs will be run in parallel on multiple nodes although there may not be a significant reduction in runtime since there are fewer processes than cores on a single node for some I-TASSER scripts. There will be less idle time for CPU cores than using gnuparallel since jobs are submitted using a single core.
  • When using the parallel runstyle in your runI-TASSER.pl job script, submit your job using 3 tasks and 21GB memory.
    • Other scripts such as runCOFACTOR.pl may need more initial tasks but generally each process uses a single-core.
  • Each of automatically generated parallel jobs created are hard coded to use 1 core, 7GB memory for 3 days walltime.
  • If your job fails due to not enough resources, send a message to the HPRC helpdesk and we will expand the resources for the automatically generated jobs.
  • It is possible that your job could fail if you do not have enough SUs to schedule at least 15 single core jobs for 3 days each (1080 SUs)
example job script
#!/bin/bash
#SBATCH --export=NONE               # do not export current env to the job
#SBATCH --job-name=itasser          # job name
#SBATCH --time=1-00:00:00           # max job run time dd-hh:mm:ss
#SBATCH --ntasks-per-node=3         # tasks (commands) per compute node
#SBATCH --cpus-per-task=1           # CPUs (threads) per command
#SBATCH --mem=21G                   # total memory per node
#SBATCH --output=stdout.%x.%j          # save stdout to file
#SBATCH --error=stderr.%x.%j           # save stderr to file

module load GCC/9.3.0  I-TASSER/5.1-Python-3.8.2

# your sequence.fasta file containing a single protein sequence is in a directory named my_datadir
runI-TASSER.pl -java_home $EBROOTJAVA -runstyle parallel -datadir my_datadir -libdir /scratch/data/bio/i-tasser/5.1 -seqname my_seq_name

gnuparallel

  • example command:
    • runI-TASSER.pl -java_home $EBROOTJAVA -runstyle gnuparallel -datadir my_datadir -libdir /scratch/data/bio/i-tasser/5.1 -seqname my_seq_name
  • All jobs will be run in parallel on a single-node
  • When using the gnuparallel runstyle in your job script, submit your job using 3 tasks and 21GB memory.
    • each of automatically generated gnuparallel processes created are hard coded to use 1 core, 7GB memory for 3 days walltime.

serial

  • This is the default if you do not specify -runstyle
  • Avoid using this mode since it can take 6x longer to run in serial mode.

benchmarks

runI-TASSER.pl

serial:  1 day 5 hr 22 min

gnuparallel:   5 hr  3 min
 (single-node)

parallel:      4 hr 57 min
 (Slurm; multi-node)

STRIDE

GCATemplates available: no

STRIDE homepage

 module spider Stride

STRIDE is a program to recognize secondary structural elements in proteins from their atomic coordinates