Sequence Variants (SNPs, indels)
Call Variants
FreeBayes
FreeBayes is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs (single-nucleotide polymorphisms), indels (insertions and deletions), MNPs (multi-nucleotide polymorphisms), and complex events (composite insertion and substitution events) smaller than the length of a short-read sequencing alignment.
Platypus
Platypus homepage github page
Platypus is a tool designed for efficient and accurate variant-detection in high-throughput sequencing data. By using local realignment of reads and local assembly it achieves both high sensitivity and high specificity. Platypus can detect SNPs, MNPs, short indels, replacements and (using the assembly option) deletions up to several kb.
GATK
GCATemplates available: no
GATK 4.0
Use the following command to run GATK version 4.0+ instead of using the .jar file
gatk
If whole genome processing for a large genome is taking a long time, you can use intervals to parallelize the analysis and analyse sequential regions of a chromosome or more specifically just coding regions.
Below is a step-by-step breakdown of the Best Practices workflow, with a detailed explanation of why -L should or shouldn’t be used with each tool.
Tool | -L | Why or why not? |
---|---|---|
BaseRecalibrator | YES | This excludes off-target sequences and sequences that may be poorly mapped, which have a higher error rate. Including them could lead to a skewed model and bad recalibration. |
PrintReads | NO | Output is a bam file; using -L would lead to lost data. |
UnifiedGenotyper/Haplotype Caller | YES | We’re only interested in making calls in exome regions; the rest is a waste of time & includes lots of false positives. |
Next steps | NO | No need since subsequent steps operate on the callset, which was restricted to the exome at the calling step. |
GATK 3.0
Recommended workflows for variant discovery analysis with GATK
module spider GATK
To execute GATK run:
The GATK is the industry standard for identifying SNPs and indels in germline DNA and RNAseq data
The GATK tools are primarily designed to process exomes and whole genomes generated with Illumina sequencing technology, but they can be adapted to handle a variety of other technologies and experimental designs.
SAMtools
GCATemplates available: yes (look at bwa templates)
SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.
Example of sorting and creating a bam file without creating an intermediate sam file (reserving the entire compute node 20 cores).
Notice that using the $TMPDIR for temporary files is useful since files in $TMPDIR do now count against your quota.
bwa mem -M -t 20 -R "@RG\tID:my_readgroup\tLB:my_library\tSM:my_sample\tPL:ILLUMINA" \
my_ref_genome.fasta paired_end_R1.fastq.gz paired_end_R2.fastq.gz | samtools view -h -Sb - | \
samtools sort -m 2G --threads 10 - -T $TMPDIR/sorted -o my_output_bam
IMSindel
GCATemplates available: no
IMSindel An accurate intermediate-size indel detection tool incorporating de novo assembly and gapped global-local alignment with split read analysis.
module spider IMSindel/1.0.1-foss-2019b-Ruby-2.7.1
The output directory option --outd is required and the directory must exist prior to running imsindel.
The memory usage is low and you can complete the analysis in a timely manner by using half the cores on a compute node since the early step uses multiple cores but the majority of the remaining processing uses one core.
Parsnp
GCATemplates available: no
Parsnp is a command-line-tool for efficient microbial core genome alignment and SNP detection. Parsnp was designed to work in tandem with Gingr, a flexible platform for visualizing genome alignments and phylogenetic trees; both Parsnp and Gingr form part of the Harvest suite.
kSNP
GCATemplates available: no
kSNP identifies the pan-genome SNPs in a set of genome sequences, and estimates phylogenetic trees based upon those SNPs. SNP discovery is based on k-mer analysis, and requires no multiple sequence alignment or the selection of a reference genome, so kSNP can take 100's of microbial genomes as input.
Delly
GCATemplates available:
Delly homepage
Delly is an integrated structural variant (SV) prediction method that can discover, genotype and visualize deletions, tandem duplications, inversions and translocations at single-nucleotide resolution in short-read massively parallel sequencing data. It uses paired-ends, split-reads and read-depth to sensitively and accurately delineate genomic rearrangements throughout the genome. Structural variants can be visualized using Delly-maze and Delly-suave.
Delly primarily parallelizes on the sample level. Hence, OMP_NUM_THREADS should be always smaller or equal to the number of input samples.
If you have 5 samples, for example, you would use the following in your job script along with the appropriate #SBATCH values:
export OMP_NUM_THREADS=5
Annotate Variants
snpEff
GCATemplates available: no
snpEff homepage
Genetic variant annotation and effect prediction toolbox.
To see a list of available snpEff genome databases
Download pre-built database
This must be done on a login node and not in a job script.
1) First load the snpEff version of your choice
module load snpEff/4.3t-foss-2018b-Python-3.6.6-Java-1.8.0
2) Copy the main snpEff.config file to your current directory
cp $EBROOTSNPEFF/snpEff.config ./
3) See if the genome is available (save output to a file then search the file):
java -Xmx4g -jar $EBROOTSNPEFF/snpEff.jar databases > avail_dbs.txt
4) Download database using your snpEff.config file which will download it to the current directory in a directory called data:
java -Xmx4g -jar $EBROOTSNPEFF/snpEff.jar download -c snpEff.config Sporisorium_reilianum
5) Use the -config option when running snpEff commands in your job script:
java -Xmx4g -jar $EBROOTSNPEFF/snpEff.jar -config snpEff.config Sporisorium_reilianum my_file.vcf > annotated.vcf
Create your own database
You can create your own custom database with a sequences.fasta and genome.gff3 file (or .gff).
This must be done on a login node and not in a job script.
The following example creates a database for AaegL5.0
1) First load the snpEff version of your choice
module load snpEff/4.3t-foss-2018b-Python-3.6.6-Java-1.8.0
2) Make a directory for your new config file and database files
mkdir -p $SCRATCH/snpeff/4.3t
3) Copy the main snpEff.config file to your new directory
cp $EBROOTSNPEFF/snpEff.config $SCRATCH/snpeff/4.3t/
4) Edit your snpEff.config file and change the data.dir to the directory in your $SCRATCH (use your NETID in the full path)
data.dir = /scratch/user/
NETID
/snpeff/4.3t
5) Add a unique line to your snpEff.config file with your custom genome name using spaces:
AaegL5.0.genome : AaegL5.0
6) Make a directory with the same name that you added to the snpEff.config file and copy your unzipped reference genome and gff3 files to the new directory $SCRATCH/snpeff/4.3t/
mkdir $SCRATCH/snpeff/4.3t/AaegL5.0
cp my_reference_genome.fa $SCRATCH/snpeff/4.3t/AaegL5.0/
cp my_reference_genome.gff $SCRATCH/snpeff/4.3t/AaegL5.0/
7) Create symlinks in the new directory to your two files with names sequence.fa and genes.gff
cd $SCRATCH/snpeff/4.3t/AaegL5.0/
ln -s my_reference_genome.fa sequences.fa
ln -s my_reference_genome.gff genes.gff
8) Make the database using the -config option to specify the new config file.
java -jar $EBROOTSNPEFF/snpEff.jar build -config $SCRATCH/snpeff/4.3t/snpEff.config -gff3 AaegL5.0
This will create a snpEffectPredictor.bin file in the build.dir
9) You can delete the reference and gff files from the directory after you create the database
cd $SCRATCH/snpeff/4.3t/AaegL5.0/
unlink sequences.fa
unlink genes.gff
rm my_reference_genome.fa
rm my_reference_genome.gff
10) Use the -config option when running snpEff commands in your job script:
java -jar $EBROOTSNPEFF/snpEff.jar -config $SCRATCH/snpeff/4.3t/snpEff.config AaegL5.0 my_file.vcf > annotated.vcf
Variant Effect on Protein
PROVEAN
PROVEAN homepage
PROVEAN (Protein Variation Effect Analyzer) is a software tool which predicts whether an amino acid substitution or indel has an impact on the biological function of a protein.
PROVEAN is useful for filtering sequence variants to identify nonsynonymous or indel variants that are predicted to be functionally important.
Since it uses BLAST and psiBLAST against the nr database, it is slow and may take several minutes per query.
If you are working with human variants, you can download a file of precomputed scores for every amino acid change at every position for all human genes at: http://provean.jcvi.org/downloads.php