NGS: Sequence QC
Evaluation
FastQC
module spider FastQC
After running FastQC via the command line, you can ssh to an HPRC cluster enabling X11 forwarding by using the -X option and view the images using the eog tool.
From your desktop:
ssh -X username@grace.hprc.tamu.edu
From your FastQC working directory on Grace unzip the .zip results file then use eog to view the results in the Images directory:
eog sample_fastqc/Images/per_sequence_gc_content.png
You can also run FastQC interactively using the FastQC GUI by logging in using X11 forwarding and running the command:
fastqc
RNA-SeQC
GCATemplates available: no
RNA-SeQC homepage
module spider RNA-SeQC
RNA-SeQC is a java program which computes a series of quality control metrics for RNA-seq data.
To run RNA-SeQC after loading the module:
java -jar $EBROOTRNASEQC/RNA-SeQC_v1.1.8.jar
KmerGenie
GCATemplates available: no
KmerGenie homepage
module spider KmerGenie
KmerGenie estimates the best k-mer length for genome de novo assembly.
Qualimap
GCATemplates available: no
Qualimap homepage
- fast analysis across the reference genome of mapping coverage and nucleotide distribution;
- easy-to-interpret summary of the main properties of the alignment data;
- analysis of the reads mapped inside/outside of the regions defined in an annotation reference;
- computation and analysis of read counts obtained from intersting of read alignments with genomic features;
- analysis of the adequacy of the sequencing depth in RNA-seq experiments;
- support for multi-sample comparison for alignment data and counts data;
- clustering of epigenomic profiles.
module spider Qualimap
Enter the following command to see the command line options
qualimap -h
Qualimap will use more than one core so you will need to specify the number of cores using the qualimap -nt option.
You can capture the number of cores you specify in the Slurm parameters with the environment variable $SLURM_CPUS_PER_TASK
For example if you have these Slurm parameters:
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=28
Then you can use the -n value with the environment variable $SLURM_CPUS_PER_TASK to specify the number of cores for qualimap to use
qualimap -nt $SLURM_CPUS_PER_TASK
If you run qualimap without options, it will start the GUI version. The GUI version works best with the HPRC Portal.
If you would like to use the GUI version, you can login to the OnDemand portal at portal.hprc.tamu.edu and select VNC in the 'Interactive Apps' tab. You might start with all cores and all memory initially until you get an idea of how much memory is required for Qualimap.
When the VNC loads after clicking the blue launch button, you will reach a terminal where you can start the GUI version of Qualimap with the following commands
module purge
module spider Qualimap
# load the latest version using the appropriate module load command then run qualimap
qualimap
Screen Reads
FastQScreen
GCATemplates available: no
FastQ Screen allows you to screen a library of sequences in FastQ format against a set of sequence databases so you can see if the composition of the library matches with what you expect.
module spider FastQ-Screen
After loading the FastQ-Screen module, copy the config file to your working directory:
cp $EBROOTFASTQMINSCREEN/fastq_screen.conf.example fastq_screen.conf && chmod u+w fastq_screen.conf
There are some databases already available on Grace and FASTER.
Add the following line to the fastq_screen.conf file to screen for the PhiX database using Bowtie2.
DATABASE PhiX /scratch/data/bio/genome_indexes/ncbi/PhiX/bowtie2/NC_001422.1
Trim
Trimmomatic
Trimmomatic homepage
Trimmomatic manual
module spider Trimmomatic
Trimmomatic performs a variety of useful trimming tasks for illumina paired-end and single ended data
Sample command for version 0.39
java -jar $EBROOTTRIMMOMATIC/trimmomatic-0.39.jar [SE|PE] `<options>` `<files>` ... ILLUMINACLIP:$EBROOTTRIMMOMATIC/adapters/TruSeq3-PE.fa:2:30:10
Cutadapt
GCATemplates available: no
Cutadapt homepage
module spider cutadapt
Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.
Sequencing Error Correction
Quake
GCATemplates available: no
Quake homepage
module spider Quake
Quake is a package to correct substitution sequencing errors in experiments with deep coverage (e.g. >15X), specifically intended for Illumina sequencing reads.
Lighter
GCATemplates available: no
Lighter homepage
module spider Lighter
Lighter is a kmer-based error correction method for whole genome sequencing data.
Lighter uses sampling (rather than counting) to obtain a set of kmers that are likely from the genome.
Using this information, Lighter can correct the reads containing sequence errors.
Merge overlapping reads
FLASH
FLASH homepage
module spider FLASH
FLASH (Fast Length Adjustment of SHort reads) is a very fast and accurate software tool to merge paired-end reads from next-generation sequencing experiments.
FLASH is designed to merge pairs of reads when the original DNA fragments are shorter than twice the length of reads.
The resulting longer reads can significantly improve genome assemblies.
They can also improve transcriptome assembly when FLASH is used to merge RNA-seq data.
Pear
GCATemplates available: no
Pear homepage
module spider Pear
PEAR is an ultrafast, memory-efficient and highly accurate pair-end read merger. It is fully parallelized and can run with as low as just a few kilobytes of memory.
PEAR evaluates all possible paired-end read overlaps and without requiring the target fragment size as input. In addition, it implements a statistical test for minimizing false-positive results.
Coperead
GCATemplates available: no
Coperead homepage
module spider Coperead
COPE (Connecting Overlapped Pair-End reads) is a method to align and connect the illumina sequenced Pair-End reads of which the insert size is smaller than the sum of the two read length.