NGS: Sequence QC

Evaluation

FastQC

Grace

module spider FastQC

After running FastQC via the command line, you can ssh to an HPRC cluster enabling X11 forwarding by using the -X option and view the images using the eog tool.

From your desktop:

ssh -X username@grace.hprc.tamu.edu

From your FastQC working directory on Grace unzip the .zip results file then use eog to view the results in the Images directory:

eog sample_fastqc/Images/per_sequence_gc_content.png

You can also run FastQC interactively using the FastQC GUI by logging in using X11 forwarding and running the command:

fastqc

RNA-SeQC

GCATemplates available: no

RNA-SeQC homepage

module spider RNA-SeQC

RNA-SeQC is a java program which computes a series of quality control metrics for RNA-seq data.

To run RNA-SeQC after loading the module:

java -jar $EBROOTRNASEQC/RNA-SeQC_v1.1.8.jar

KmerGenie

GCATemplates available: no

KmerGenie homepage

module spider KmerGenie

KmerGenie estimates the best k-mer length for genome de novo assembly.

Qualimap

GCATemplates available: no

Qualimap homepage

fast analysis across the reference genome of mapping coverage and nucleotide distribution;
easy-to-interpret summary of the main properties of the alignment data;
analysis of the reads mapped inside/outside of the regions defined in an annotation reference;
computation and analysis of read counts obtained from intersting of read alignments with genomic features;
analysis of the adequacy of the sequencing depth in RNA-seq experiments;
support for multi-sample comparison for alignment data and counts data;
clustering of epigenomic profiles.

module spider Qualimap

Enter the following command to see the command line options

qualimap -h

Qualimap will use more than one core so you will need to specify the number of cores using the qualimap -nt option.

You can capture the number of cores you specify in the Slurm parameters with the environment variable $SLURM_CPUS_PER_TASK

For example if you have these Slurm parameters:

#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=28

Then you can use the -n value with the environment variable $SLURM_CPUS_PER_TASK to specify the number of cores for qualimap to use

qualimap -nt $SLURM_CPUS_PER_TASK

If you run qualimap without options, it will start the GUI version. The GUI version works best with the HPRC Portal.

If you would like to use the GUI version, you can login to the OnDemand portal at portal.hprc.tamu.edu and select VNC in the 'Interactive Apps' tab. You might start with all cores and all memory initially until you get an idea of how much memory is required for Qualimap.

When the VNC loads after clicking the blue launch button, you will reach a terminal where you can start the GUI version of Qualimap with the following commands

module purge
module spider Qualimap
# load the latest version using the appropriate module load command then run qualimap
qualimap

Screen Reads

FastQScreen

GCATemplates available: no

homepage

FastQ Screen allows you to screen a library of sequences in FastQ format against a set of sequence databases so you can see if the composition of the library matches with what you expect.

module spider FastQ-Screen

After loading the FastQ-Screen module, copy the config file to your working directory:

cp $EBROOTFASTQMINSCREEN/fastq_screen.conf.example fastq_screen.conf && chmod u+w fastq_screen.conf

There are some databases already available on Grace and FASTER.

Add the following line to the fastq_screen.conf file to screen for the PhiX database using Bowtie2.

DATABASE  PhiX   /scratch/data/bio/genome_indexes/ncbi/PhiX/bowtie2/NC_001422.1

Trim

Trimmomatic

GCATemplates * Grace (pe)

Trimmomatic homepage

Trimmomatic manual

module spider Trimmomatic

Trimmomatic performs a variety of useful trimming tasks for illumina paired-end and single ended data

Sample command for version 0.39

java -jar $EBROOTTRIMMOMATIC/trimmomatic-0.39.jar [SE|PE] `<options>` `<files>` ... ILLUMINACLIP:$EBROOTTRIMMOMATIC/adapters/TruSeq3-PE.fa:2:30:10

Cutadapt

GCATemplates available: no

Cutadapt homepage

module spider cutadapt

Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.

Sequencing Error Correction

Quake

GCATemplates available: no

Quake homepage

module spider Quake

Quake is a package to correct substitution sequencing errors in experiments with deep coverage (e.g. >15X), specifically intended for Illumina sequencing reads.

Lighter

GCATemplates available: no

Lighter homepage

module spider Lighter

Lighter is a kmer-based error correction method for whole genome sequencing data.

Lighter uses sampling (rather than counting) to obtain a set of kmers that are likely from the genome.

Using this information, Lighter can correct the reads containing sequence errors.

Merge overlapping reads

FLASH

GCATemplates

Grace

FLASH homepage

module spider FLASH

FLASH (Fast Length Adjustment of SHort reads) is a very fast and accurate software tool to merge paired-end reads from next-generation sequencing experiments.

FLASH is designed to merge pairs of reads when the original DNA fragments are shorter than twice the length of reads.

The resulting longer reads can significantly improve genome assemblies.

They can also improve transcriptome assembly when FLASH is used to merge RNA-seq data.

Pear

GCATemplates available: no

Pear homepage

module spider Pear

PEAR is an ultrafast, memory-efficient and highly accurate pair-end read merger. It is fully parallelized and can run with as low as just a few kilobytes of memory.

PEAR evaluates all possible paired-end read overlaps and without requiring the target fragment size as input. In addition, it implements a statistical test for minimizing false-positive results.

Coperead

GCATemplates available: no

Coperead homepage

module spider Coperead

COPE (Connecting Overlapped Pair-End reads) is a method to align and connect the illumina sequenced Pair-End reads of which the insert size is smaller than the sum of the two read length.