Difference between revisions of "SW:BLAST"
(→benchmarks) |
(→example commands) |
||
Line 12: | Line 12: | ||
simple BLAST+ command using the local shared blast nr database for protein query fasta file: | simple BLAST+ command using the local shared blast nr database for protein query fasta file: | ||
− | blastx -query my_protein_sequences.fasta -db /scratch/ | + | blastx -query my_protein_sequences.fasta -db /scratch/data/bio/blast/nr -outfmt 10 -out my_protein_sequences_nr_blastout.csv |
=== check database release === | === check database release === |
Latest revision as of 11:09, 1 September 2021
Contents
BLAST & BLAST+
GCATemplates available: ada (blastx)
module spider BLAST
or
module spider BLAST+
example commands
simple BLAST+ command using the local shared blast nr database for protein query fasta file:
blastx -query my_protein_sequences.fasta -db /scratch/data/bio/blast/nr -outfmt 10 -out my_protein_sequences_nr_blastout.csv
check database release
To check the downloaded version of the nr or nt database, load the latest version of BLAST+ and use the following command:
blastdbcmd -info -db /scratch/data/bio/blast/nr
v5 database
BLAST+ version 2.8.1 and newer supports the newest version 5 of the BLAST database which allows limiting a search by taxonomy.
The version 5 BLAST nr and nt databases and others are available in /scratch/data/bio/blast and used in the blast command like the following:
/scratch/data/bio/blast/nr /scratch/data/bio/blast/nt
limit by taxonomy
Use -taxids to specify a taxonomy id:
blastn -query myseqs.fasta -taxids 33208 -db /scratch/data/bio/blast/nt -outfmt 5 -out blastout.xml -num_threads 8
If you get an error like "BLAST Database error: Taxonomy ID(s) not found.", it is most likely too high of a taxon level.
In order to get taxid list, which you can do on the command line prior to submitting a job, you need to load the EDirect module and a compatible BLAST+ module
EDirect/15.0.20210505-GCCcore-10.2.0 BLAST+/2.11.0-gompi-2020b
Then run the get_species_taxids.sh script and save the output to a file (33208.txids in this example)
get_species_taxids.sh -t 33208 > 33208.txids
Then use the -taxidlist option instead of -taxids in your job script
blastn -query myseqs.fasta -taxidlist 33208.txids -db /scratch/data/bio/blast/nt -outfmt 5 -out blastout.xml -num_threads 8
(you can also get the same subtree species taxid list from this webpage with 33208 used as an example): https://www.ncbi.nlm.nih.gov/taxonomy/?term=txid33208%5Bsubtree%5D
v4 database
BLAST version 4 databases (pre BLAST+ 2.8.1) such as nr and nt are not available right now.
benchmarks
The recommended number of cores to use with blast on Terra is all 28.
BLAST+ v2.11.0 blastx benchmarks for 25 nucleotide sequences vs nr on Terra
cores memory runtime hh:mm:ss 1 4GB > 24 hours 8 54GB 03:37:32 14 54GB 02:23:22 28 54GB 01:50:44