Hprc banner tamu.png

Difference between revisions of "SW:BLAST"

From TAMU HPRC
Jump to: navigation, search
(BLAST & BLAST+)
(example commands)
 
(11 intermediate revisions by 2 users not shown)
Line 8: Line 8:
 
   module spider BLAST+
 
   module spider BLAST+
  
The maximum recommended number of cores to use with blast is 8.
+
=== example commands ===
  
BLAST+ v2.7.1 blastp benchmarks for 1 protein sequence vs nr on Terra
+
simple BLAST+ command using the local shared blast nr database for protein query fasta file:
 +
 
 +
  blastx -query my_protein_sequences.fasta -db /scratch/data/bio/blast/nr -outfmt 10 -out my_protein_sequences_nr_blastout.csv
 +
 
 +
=== check database release ===
 +
To check the downloaded version of the nr or nt database, load the latest version of BLAST+ and use the following command:
 
<pre>
 
<pre>
1 core    2GB mem  36.1 minutes
+
blastdbcmd -info -db /scratch/data/bio/blast/nr
7 cores  12GB mem  14.6 minutes
 
8 cores  14GB mem  14.2 minutes
 
28 cores  54GB mem  27.5 minutes
 
 
</pre>
 
</pre>
  
BLAST version 4 databases (pre BLAST+ 2.8.1) such as nr and nt can be found here (BLAST+ uses the same databases):
+
=== v5 database ===
  /scratch/datasets/blast (Ada)
+
BLAST+ version 2.8.1 and newer supports the newest version 5 of the BLAST database which allows [https://www.ncbi.nlm.nih.gov/books/NBK569846/ limiting a search by taxonomy].
  /scratch/data/bio/blast (Terra)
+
 
 +
The version 5 BLAST nr and nt databases and others are available in /scratch/data/bio/blast and used in the blast command like the following:
 +
<pre>
 +
/scratch/data/bio/blast/nr
 +
/scratch/data/bio/blast/nt
 +
</pre>
  
Sample BLAST+ command:
+
==== limit by taxonomy ====
 +
Use -taxids to specify a taxonomy id:
 +
blastn -query myseqs.fasta '''-taxids 33208''' -db /scratch/data/bio/blast/nt -outfmt 5 -out blastout.xml -num_threads 8
  
  blastx -query my_protein_sequences.fasta -db /scratch/datasets/blast/nr -outfmt 10 -out my_protein_sequences_nr_blastout.csv
+
If you get an error like ''"BLAST Database error: Taxonomy ID(s) not found."'', it is most likely too high of a taxon level.
  
To see the current version of the nr or nt database use the following command:
+
In order to get taxid list, which you can do on the command line prior to submitting a job, you need to load the EDirect module and a compatible BLAST+ module
 
<pre>
 
<pre>
module load BLAST/2.2.26-x64-linux
+
EDirect/15.0.20210505-GCCcore-10.2.0
fastacmd -d /scratch/datasets/blast/nr -I
+
BLAST+/2.11.0-gompi-2020b
 
</pre>
 
</pre>
 +
Then run the get_species_taxids.sh script and save the output to a file (33208.txids in this example)
 +
get_species_taxids.sh -t 33208 > 33208.txids
 +
 +
Then use the -taxidlist option instead of -taxids in your job script
 +
blastn -query myseqs.fasta '''-taxidlist 33208.txids''' -db /scratch/data/bio/blast/nt -outfmt 5 -out blastout.xml -num_threads 8
 +
 +
(you can also get the same subtree species taxid list from this webpage with 33208 used as an example):
 +
https://www.ncbi.nlm.nih.gov/taxonomy/?term=txid33208%5Bsubtree%5D
  
BLAST+ version 2.8.1 supports the newest version 5 of the BLAST database which allows limiting a search based on taxonomy at the '''''species''''' level.
+
=== v4 database ===
 +
BLAST version 4 databases (pre BLAST+ 2.8.1) such as nr and nt are not available right now.
  
<pre>
+
=== benchmarks ===
module load BLAST+/2.8.1-intel-2017b
+
The recommended number of cores to use with blast on Terra is all 28.
</pre>
 
  
The version 5 BLAST nr and nt databases are in the following directories. Let us know if you need others.
+
BLAST+ v2.11.0 blastx benchmarks for 25 nucleotide sequences vs nr on Terra
 
<pre>
 
<pre>
Ada:
+
  cores    memory    runtime hh:mm:ss
/scratch/datasets/blastdbv5/nr_v5
+
    1      4GB      > 24 hours
/scratch/datasets/blastdbv5/nt_v5
+
    8      54GB      03:37:32
 
+
  14      54GB      02:23:22
Terra:
+
  28      54GB      01:50:44
/scratch/data/bio/blastdb_v5/nr_v5
 
/scratch/data/bio/blastdb_v5/nt_v5
 
 
</pre>
 
</pre>
  
If your taxid is not recognized, it is most likely too high of a taxon level. Get subtree species taxid list from this webpage (33208 used as an example):
 
https://www.ncbi.nlm.nih.gov/taxonomy/?term=txid33208%5Bsubtree%5D
 
 
[[ Category:SW ]] [[ Category: Bioinformatics ]]
 
[[ Category:SW ]] [[ Category: Bioinformatics ]]

Latest revision as of 10:09, 1 September 2021

BLAST & BLAST+

GCATemplates available: ada (blastx)

 module spider BLAST

or

 module spider BLAST+

example commands

simple BLAST+ command using the local shared blast nr database for protein query fasta file:

 blastx -query my_protein_sequences.fasta -db /scratch/data/bio/blast/nr -outfmt 10 -out my_protein_sequences_nr_blastout.csv

check database release

To check the downloaded version of the nr or nt database, load the latest version of BLAST+ and use the following command:

blastdbcmd -info -db /scratch/data/bio/blast/nr

v5 database

BLAST+ version 2.8.1 and newer supports the newest version 5 of the BLAST database which allows limiting a search by taxonomy.

The version 5 BLAST nr and nt databases and others are available in /scratch/data/bio/blast and used in the blast command like the following:

/scratch/data/bio/blast/nr
/scratch/data/bio/blast/nt

limit by taxonomy

Use -taxids to specify a taxonomy id:

blastn -query myseqs.fasta -taxids 33208 -db /scratch/data/bio/blast/nt -outfmt 5 -out blastout.xml -num_threads 8

If you get an error like "BLAST Database error: Taxonomy ID(s) not found.", it is most likely too high of a taxon level.

In order to get taxid list, which you can do on the command line prior to submitting a job, you need to load the EDirect module and a compatible BLAST+ module

EDirect/15.0.20210505-GCCcore-10.2.0
BLAST+/2.11.0-gompi-2020b

Then run the get_species_taxids.sh script and save the output to a file (33208.txids in this example)

get_species_taxids.sh -t 33208 > 33208.txids

Then use the -taxidlist option instead of -taxids in your job script

blastn -query myseqs.fasta -taxidlist 33208.txids -db /scratch/data/bio/blast/nt -outfmt 5 -out blastout.xml -num_threads 8

(you can also get the same subtree species taxid list from this webpage with 33208 used as an example): https://www.ncbi.nlm.nih.gov/taxonomy/?term=txid33208%5Bsubtree%5D

v4 database

BLAST version 4 databases (pre BLAST+ 2.8.1) such as nr and nt are not available right now.

benchmarks

The recommended number of cores to use with blast on Terra is all 28.

BLAST+ v2.11.0 blastx benchmarks for 25 nucleotide sequences vs nr on Terra

  cores    memory    runtime hh:mm:ss
    1       4GB      > 24 hours
    8      54GB      03:37:32
   14      54GB      02:23:22
   28      54GB      01:50:44