Bioinformatics:Sequence QC
NGS: Sequence QC
Back to Bioinformatics Main Menu
Contents
Evaluation
FastQC
GCATemplates available: grace terra
module spider FastQC
After running FastQC via the command line, you can ssh to an HPRC cluster enabling X11 forwarding by using the -X option and view the images using the eog tool.
From your desktop:
ssh -X username@grace.hprc.tamu.edu
From your FastQC working directory on Grace unzip the .zip results file then use eog to view the results in the Images directory:
eog sample_fastqc/Images/per_sequence_gc_content.png
You can also run FastQC interactively using the FastQC GUI by logging in using X11 forwarding and running the command:
fastqc
RNA-SeQC
GCATemplates available: ada (w/bwa) ada
RNA-SeQC homepage
module spider RNA-SeQC
RNA-SeQC is a java program which computes a series of quality control metrics for RNA-seq data.
To run RNA-SeQC after loading the module:
java -jar $EBROOTRNASEQC/RNA-SeQC_v1.1.8.jar
KmerGenie
GCATemplates available: ada
KmerGenie homepage
module spider KmerGenie
KmerGenie estimates the best k-mer length for genome de novo assembly.
Qualimap
GCATemplates available: no
Qualimap homepage
- fast analysis across the reference genome of mapping coverage and nucleotide distribution;
- easy-to-interpret summary of the main properties of the alignment data;
- analysis of the reads mapped inside/outside of the regions defined in an annotation reference;
- computation and analysis of read counts obtained from intersting of read alignments with genomic features;
- analysis of the adequacy of the sequencing depth in RNA-seq experiments;
- support for multi-sample comparison for alignment data and counts data;
- clustering of epigenomic profiles.
module spider Qualimap
Enter the following command to see the command line options
qualimap -h
Qualimap will use more than one core so you will need to specify the number of cores using the qualimap -nt option.
You can capture the number of cores you specify in the Slurm parameters with the environment variable $SLURM_CPUS_PER_TASK
For example if you have these Slurm parameters:
#SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=28
Then you can use the -n value with the environment variable $SLURM_CPUS_PER_TASK to specify the number of cores for qualimap to use
qualimap -nt $SLURM_CPUS_PER_TASK
If you run qualimap without options, it will start the GUI version. The GUI version works best with the HPRC Portal.
If you would like to use the GUI version, you can login to the OnDemand portal at portal.hprc.tamu.edu and select VNC in the 'Interactive Apps' tab. You might start with all cores and all memory initially until you get an idea of how much memory is required for Qualimap.
When the VNC loads after clicking the blue launch button, you will reach a terminal where you can start the GUI version of Qualimap with the following commands
module purge module spider Qualimap # load the latest version using the appropriate module load command then run qualimap qualimap
Screen Reads
FastQScreen
GCATemplates available: no
FastQ Screen allows you to screen a library of sequences in FastQ format against a set of sequence databases so you can see if the composition of the library matches with what you expect.
module load FastQScreen/0.11.4-intel-2015B-Perl-5.20.0
Version 0.11.4 requires the full path to the alignment binary. Use any of the following in your config file
BOWTIE /software/easybuild/software/Bowtie/1.1.2-intel-2015B/bin/bowtie BOWTIE2 /software/easybuild/software/Bowtie2/2.2.9-intel-2015B/bin/bowtie2 BWA /software/easybuild/software/BWA/0.7.15-intel-2015B/bin/bwa BISMARK /software/easybuild/software/Bismark/0.17.0-intel-2015B/bismark
There are some databases already available on Ada. Your config file will look like the following for screening the PhiX and UniVec databases using Bowtie2.
BOWTIE2 /software/easybuild/software/Bowtie2/2.2.9-intel-2015B/bin/bowtie2 DATABASE PhiX /scratch/datasets/genome_indexes/ncbi/PhiX/bowtie2/NC_001422.1 DATABASE UniVec /scratch/datasets/genome_indexes/univec/UniVec_Core/bowtie2/UniVec_Core.fa
Kraken2
GCATemplates available: grace
Kraken2 is a system for assigning taxonomic labels to short DNA sequences, usually obtained through metagenomic studies.
Kraken2 genomic sequence databases for their respective clusters are found in the following Grace directory:
/scratch/data/bio/kraken2
Use a protein database (directories ending in _protein) if you have RNA-seq data
The standard database includes bacterial, archaeal, and viral domains, along with the human genome and a collection of known vectors (UniVec_Core).
Send a request to the HPRC helpdesk if you need other Kraken2 databases.
Kraken
GCATemplates available: ada
Kraken homepage
Kraken is a system for assigning taxonomic labels to short DNA sequences, usually obtained through metagenomic studies.
You will need to use at least the 256GB which is recommended for the bacteria database.
module load Kraken/1.1-GCCcore-6.3.0-Perl-5.24.0
After you load the modules, you will need to set the KRAKEN_DB_PATH environment variable
export KRAKEN_DB_PATH=/scratch/datasets/kraken
Currently only the bacteria database is available, If you need other Kraken databases advise the HPRC helpdesk which db you need.
Sample one-line command for single end reads file on a 256GB node for a fastq zipped file
kraken --preload --db bacteria --threads 20 --gzip-compressed --fastq-output --classified-out my_reads_out --fastq-input my_reads.fastq.gz --output kraken.out
You only have to run the first command with --preload, all other kraken commands do not need the --preload option
Sample command to create labels for the classified hits
kraken-report --db bacteria kraken.out > kraken_report.out
Trim
Trimmomatic
GCATemplates available: grace (pe)
Trimmomatic homepage
Trimmomatic manual
module spider Trimmomatic
Trimmomatic performs a variety of useful trimming tasks for illumina paired-end and single ended data
Sample command for version 0.39
java -jar $EBROOTTRIMMOMATIC/trimmomatic-0.39.jar [SE|PE] <options> <files> ... ILLUMINACLIP:$EBROOTTRIMMOMATIC/adapters/TruSeq3-PE.fa:2:30:10
Cutadapt
GCATemplates available: no
Cutadapt homepage
module spider cutadapt
Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.
Galaxy
To keep all reads that do not align to any database and also reads that align to all databases (conserved reads) you need to do three steps for paired end files:
1) use FastQScreen with the following options: Filter results into new fastq file? => Align and graph results without filtering fastq file Select how many reads to screen => Use all reads Create tagged file? => Yes Then select the aligner and check the checkboxes for the databases you want. This will produce a TAGGED file which you will use as input for the "Select FastQ Screen reads"
2) In the "Select FastQ Screen reads" tool, select the following options: that => match tags Tag type => preconfigured Tags => All db's are 0's or all db's are non-0's
3) Then if these are paired reads, you can use the Re-pair tool.
Sequencing Error Correction
Quake
GCATemplates available: no
Quake homepage
module spider Quake
Quake is a package to correct substitution sequencing errors in experiments with deep coverage (e.g. >15X), specifically intended for Illumina sequencing reads.
Lighter
GCATemplates available: no
Lighter homepage
module spider Lighter
Lighter is a kmer-based error correction method for whole genome sequencing data.
Lighter uses sampling (rather than counting) to obtain a set of kmers that are likely from the genome.
Using this information, Lighter can correct the reads containing sequence errors.
Merge overlapping reads
FLASH
GCATemplates available: grace
FLASH homepage
module spider FLASH
FLASH (Fast Length Adjustment of SHort reads) is a very fast and accurate software tool to merge paired-end reads from next-generation sequencing experiments.
FLASH is designed to merge pairs of reads when the original DNA fragments are shorter than twice the length of reads.
The resulting longer reads can significantly improve genome assemblies.
They can also improve transcriptome assembly when FLASH is used to merge RNA-seq data.
Pear
GCATemplates available: no
Pear homepage
module spider Pear
PEAR is an ultrafast, memory-efficient and highly accurate pair-end read merger. It is fully parallelized and can run with as low as just a few kilobytes of memory.
PEAR evaluates all possible paired-end read overlaps and without requiring the target fragment size as input. In addition, it implements a statistical test for minimizing false-positive results.
Coperead
GCATemplates available: no
Coperead homepage
module spider Coperead
COPE (Connecting Overlapped Pair-End reads) is a method to align and connect the illumina sequenced Pair-End reads of which the insert size is smaller than the sum of the two read length.