Hprc banner tamu.png

Difference between revisions of "SW:BLAST"

From TAMU HPRC
Jump to: navigation, search
(limit by taxonomy)
(limit by taxonomy)
Line 43: Line 43:
 
  get_species_taxids.sh -t 33208 > 33208.txids
 
  get_species_taxids.sh -t 33208 > 33208.txids
  
Then use the -taxidlist option instead of -taxids
+
Then use the -taxidlist option instead of -taxids in your job script
 
  blastn -query myseqs.fasta '''-taxidlist 33208.txids''' -db /scratch/data/bio/blast/nt -outfmt 5 -out blastout.xml -num_threads 8
 
  blastn -query myseqs.fasta '''-taxidlist 33208.txids''' -db /scratch/data/bio/blast/nt -outfmt 5 -out blastout.xml -num_threads 8
  

Revision as of 11:14, 7 May 2021

BLAST & BLAST+

GCATemplates available: ada (blastx)

 module spider BLAST

or

 module spider BLAST+

example commands

simple BLAST+ command using the local shared blast nr database for protein query fasta file:

 blastx -query my_protein_sequences.fasta -db /scratch/datasets/blast/nr -outfmt 10 -out my_protein_sequences_nr_blastout.csv

check database release

To check the downloaded version of the nr or nt database, load the latest version of BLAST+ and use the following command:

blastdbcmd -info -db /scratch/data/bio/blast/nr

v5 database

BLAST+ version 2.8.1 and newer supports the newest version 5 of the BLAST database which allows limiting a search by taxonomy.

The version 5 BLAST nr and nt databases and others are available in /scratch/data/bio/blast and used in the blast command like the following:

/scratch/data/bio/blast/nr
/scratch/data/bio/blast/nt

limit by taxonomy

Use -taxids to specify a taxonomy id:

blastn -query myseqs.fasta -taxids 33208 -db /scratch/data/bio/blast/nt -outfmt 5 -out blastout.xml -num_threads 8

If you get an error like "BLAST Database error: Taxonomy ID(s) not found.", it is most likely too high of a taxon level.

In order to get taxid list, which you can do on the command line prior to submitting a job, you need to load the EDirect module and a compatible BLAST+ module

EDirect/15.0.20210505-GCCcore-10.2.0
BLAST+/2.11.0-gompi-2020b

Then run the get_species_taxids.sh script and save the output to a file (33208.txids in this example)

get_species_taxids.sh -t 33208 > 33208.txids

Then use the -taxidlist option instead of -taxids in your job script

blastn -query myseqs.fasta -taxidlist 33208.txids -db /scratch/data/bio/blast/nt -outfmt 5 -out blastout.xml -num_threads 8

(you can also get the same subtree species taxid list from this webpage with 33208 used as an example): https://www.ncbi.nlm.nih.gov/taxonomy/?term=txid33208%5Bsubtree%5D

v4 database

BLAST version 4 databases (pre BLAST+ 2.8.1) such as nr and nt are not available right now.

benchmarks

The maximum recommended number of cores to use with blast is 8.

(This will be updated soon with a newer version of BLAST+ and the v5 database)

BLAST+ v2.7.1 blastp benchmarks for 1 protein sequence vs nr on Terra

 1 core     2GB mem   36.1 minutes
 7 cores   12GB mem   14.6 minutes
 8 cores   14GB mem   14.2 minutes
28 cores   54GB mem   27.5 minutes