RNA-seq

TUTORIALS

VIDEO "How to analyze RNA-Seq data? Find differentially expressed genes in your research"

tutorials from Griffithlab on RNA-seq analysis workflow

example R script for DESeq2

Transcriptome Assembly

Trinity

GCATemplates

Grace

module spider Trinity

Trinity homepage

Trinity tutorial videos

Trinity memory requirements summary

Trinity performs de novo or reference guided genome assemblies of transcript sequences from Illumina RNA-Seq data.

R Bioconductor libraries, such as qvalue, required by some Trinity analysis scripts are found in this R_tamu module

module load R_tamu/3.6.0-foss-2018b-recommended-mt

Trinity will create 100,000+ temporary files during an analysis run. A quota of 250,000 files (with about 225,000 available) should be enough for one Trinity run at a time. If you want to run multiple Trinity jobs at once, you can use run the analysis in $TMPDIR and copy the results files to your $SCRATCH working directory.

You need to use a directory after $TMPDIR with 'trinity' in the name which will get automatically created. Example: $TMPDIR/trinity_out

Trinity does have checkpoints. If your job times out and you run the same job script again, it should pick up where it left off unless you are using $TMPDIR.

If you are using Galaxy then checkpoints will not work.

NOTE: Trinity 2.4.0+ utilizes seqtk for read normalization which requires read headers to be in a specific format.

This version of Trinity requires either fastq headers in the new style:

@M01581:927:000000000-ARTAL:1:1101:19874:2078 1:N:0:1

or that the fastq headers end with /1 or /2 such as:

@M01581:927:000000000-ARTAL:1:1101:19874:2078/1

If your data come from SRA, be sure to dump the fastq file like so:

fastq-dump --defline-seq '@$sn[_$rn]/$ri' --split-files file.sra

Sample Trinity Paired End Assembly job Scripts

Grace: compute node

using all 28 cores for 7 days

#!/bin/bash
#SBATCH --job-name=trinity
#SBATCH --time=7-00:00:00
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=48
#SBATCH --mem=360G
#SBATCH --output=stdout.%x.%j
#SBATCH --error=stderr.%x.%j

module load GCC/9.3.0 OpenMPI/4.0.3 Trinity/2.12.0-Python-3.8.2

Trinity --workdir $TMPDIR/trinity_workdir --max_memory 300G --CPU 48 --inchworm_cpu 6 --no_version_check --seqType fq --left myreads_1.fastq.gz --right myreads_2.fastq.gz

The chrysalis step will generate thousands of small files in the $TMPDIR when specifying --workdir $TMPDIR/trinity_workdir which can help in preventing the job from creating more files than are available in your file quota.

Set the --max_memory value to a few GB below the #SBATCH --mem= to prevent reaching the memory limit with Trinity sub-processes.

The --full_cleanup option will remove all the thousands of intermediate temporary files once Trinity has completed.

If you use the --full_cleanup option (recommended), then the final Trinity.fasta file will be copied to the working directory and the outdir will be deleted.

If you do not see your Trinity.fasta file in the working directory then just submit the job script again and let it finish the final step.

If you do not use the --full_cleanup option, then the final Trinity.fasta file will be in the outdir. If you do not see it there then just submit the job script again and let it finish the final step.

using $TMPDIR

GCATemplates

Grace (pe)

You can use the $TMPDIR to write all files to a local drive on the compute node. All files written to $TMPDIR do not count against your file quota so you don't have to request a file quota increase.

Example commands to use in your job script to run Trinity and copy the result files to your working directory:

Trinity --seqType fq --max_memory 300G --single myseqs.fastq.gz --CPU 48 --no_version_check --inchworm_cpu 6 --output $TMPDIR/trinity_out
cp $TMPDIR/trinity_out/Trinity.fasta.gene_trans_map ./
cp $TMPDIR/trinity_out/Trinity.fasta ./

When Trinity is complete, the two necessary results files are copied from the $TMPDIR to the working directory. Rename the files with a prefix if needed when copying the two result files.

If you use $TMPDIR no checkpoints are available. If the job fails due to running out of walltime or memory, all files are deleted and you have to start from the beginning.

You can't use $TMPDIR if using grid mode with tamulauncher.

Example Trinity Tutorials

Trinity with edgeR

If you need more than the 3TB memory Grace compute nodes, you can use external HPCs

Scripture

GCATemplates available: no

Scripture homepage

module spider Scripture

Scripture is a method for transcriptome reconstruction that relies solely on RNA-Seq reads and an assembled genome to build a transcriptome ab initio.

The statistical methods to estimate read coverage significance are also applicable to other sequencing data.

Scripture also has modules for ChIP-Seq peak calling.

StringTie

GCATemplates available: no

StringTie homepage

module spider StringTie

StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts.

To identify differentially expressed genes between experiments, StringTie's output can be processed either by the Cuffdiff or Ballgown programs.

Trans-ABySS

GCATemplates available: no

Trans-ABySS homepage

module spider Trans-ABySS

de novo assembly of RNA-Seq data using ABySS

The current version of the Trans-ABySS package comes with 3 main applications:

1. transabyss - assemble RNAseq data

2. transabyss-merge - merge multiple assemblies from (1)

3. transabyss-analyze - analyze assemblies, either from (1) or (2), for structural variants and splice variants. Requires reference genome and annotations.

SOAPdenovo-Trans

GCATemplates available: no

SOAPdenovo-Trans home

module spider SOAPdenovo-Trans

SOAPdenovo-Trans is a de novo transcriptome assembler basing on the SOAPdenovo framework, adapt to alternative splicing and different expression level among transcripts.The assembler provides a more accurate, complete and faster way to construct the full-length transcript sets.

For animal transcriptomes like mouse, about 30-35GB memory would be required.

Transcriptome Assembly Evaluation

DETONATE

GCATemplates available: no

DETONATE homepage

module spider DETONATE

DETONATE (DE novo TranscriptOme rNa-seq Assembly with or without the Truth Evaluation) consists of two component packages, RSEM-EVAL and REF-EVAL. Both packages are mainly intended to be used to evaluate de novo transcriptome assemblies, although REF-EVAL can be used to compare sets of any kinds of genomic sequences.

BUSCO

GCATemplates

BUSCO homepage

version 5.0.x

Version 5.0.0+ now uses Metaeuk as the default gene predictor instead of Augustus so you don't need to rsync the AUGUSTUS directory unless you want to use Augustus.

Databases for v5.0.0+ are found on Grace in the directory:

/scratch/data/bio/busco5/lineages

Contact the HPRC helpdesk if you need additional databases downloaded to the shared busco5 lineages directory.

version 4.0.x

To use BUSCO you need to copy the augustus config files (about 1000 files) to your $SCRATCH directory (only need to do once). Make sure you load the BUSCO module successfully by running 'module list' after the module load command.

module purge
module load BUSCO/4.0.5-foss-2019b-Python-3.7.4
module list
mkdir $SCRATCH/my_augustus_config
rsync -r /sw/eb/software/AUGUSTUS/3.3.3-foss-2019b/  $SCRATCH/my_augustus_config
chmod -R 755 $SCRATCH/my_augustus_config

You need to add the following in your job script:

export AUGUSTUS_CONFIG_PATH="$SCRATCH/my_augustus_config/config"

The lineages for version 4.0.x use odb10. Contact HPRC helpdesk if there is not a lineage for your organism in the odb10 directory.

Grace: /scratch/data/bio/busco5/lineages/

Transcriptome Assembly Annotation

Trinotate

Available GCATemplates

Trinotate is a comprehensive annotation suite designed for automatic functional annotation of transcriptomes, particularly de novo assembled transcriptomes, from model or non-model organisms.

Trinotate home

# Grace
module load GCC/9.3.0 OpenMPI/4.0.3 Trinotate/3.2.2

Trinotate has a pipeline script that can be used to run all annotation steps at once.

Trinotate has checkpoints so if your job is stopped, you can restart the job and it should start from the last checkpoint.

You will need to copy the Trinotate.sqlite database from the following directory to your working directory. You should not have this line in your job script if you are restarting a stopped job. The following UNIX command will copy the necessary files to your current working directory.

Run this command on the UNIX command line prior to running your job:

module load GCC/9.3.0 OpenMPI/4.0.3 Trinotate/3.2.2
cp $EBROOTTRINOTATE/resources/Trinotate.sqlite ./
chmod 755 Trinotate.sqlite

Then in your job script, use the following two lines to create a gene to transcript map and run the pipeline on your Trinity.fasta assembled transcripts file:

$TRINITY_HOME/util/support_scripts/get_Trinity_gene_to_trans_map.pl Trinity.fasta > Trinity.fasta.gene_to_trans_map

$TRINOTATE_HOME/auto/autoTrinotate.pl --transcripts Trinity.fasta --CPU $SLURM_CPUS_PER_TASK \
--Trinotate_sqlite Trinotate.sqlite --gene_to_trans_map Trinity.fasta.gene_to_trans_map

TrinotateWeb

start an interactive job from the Grace command line

srun --time=1-00:00:00 --mem=56G --ntasks=1 --cpus-per-task=8 --pty bash

go to the directory where your Trinotate.sqlite file is located

cd $SCRATCH/mytrinotate/

build a directory from the singularity image file; this will create 22,006 files and directories singularity build --sandbox trinotate_web.dir /sw/hprc/sw/containers/trinotate_web_latest.sif

launch a shell which will take you inside the image to the Singularity> prompt

singularity shell -w -B $PWD:/data trinotate_web.dir

at the Singularity> prompt, run the command: run_TrinotateWebserver.pl 8080

Singularity> run_TrinotateWebserver.pl 8080

open a second terminal session in Grace and check the NODELIST column of the following command

squeue -u $USER

run the following in the terminal on your Mac or Linux desktop computer or your PowerShell if using Windows

use the compute node name (c536 in this example) of your job and keep the numbers 1234 and 8080 the same as below

ssh -N -L 1234:c536:8080 my_NetID@grace.hprc.tamu.edu

Open the following URL on your desktop computer's web browser http://localhost:1234/cgi-bin/index.cgi

enter the following for Path to Trinotate SQLite database:

/data/Trinotate.sqlite

type exit in the Grace terminal to end the interactive job when you are finished browsing the database

exit

hit Ctrl+c keys on your desktop terminal to end the ssh session when you are finished browsing the database

when you are done processing your project, delete the sandbox directory to free up disk space

rm -rf trinotate_web.dir

TransDecoder

GCATemplates available: no

TransDecoder home

module spider TransDecoder

TransDecoder identifies candidate coding regions within transcript sequences, such as those generated by de novo RNA-Seq transcript assembly using Trinity, or constructed based on RNA-Seq alignments to the genome using Tophat and Cufflinks.

miRNA

GCATemplates available: no

miRNA home

module spider miRNA

The BCGSC miRNA Profiling Pipeline produces expression profiles of known miRNAs from BWA-aligned BAM files and generates summary reports and graphs describing the results.

The miRNA application uses a local copy of the UCSC and miRBase databases. The UCSC database is populated only with the organism being analyzed.

First create a directory, go to that directory and run the startup script

mkdir $SCRATCH/mirna_setup
cd $SCRATCH/mirna_setup
/software/hprc/Bio/mirna/scripts/setup_mysql.sh

You only need to run this once and it is run on the command line and not within a job script. It takes about 10 minutes to populate the UCSC and miRBase database files.

Once you successfully run this setup script, you do not need the script any more.

You can run miRNA from any $SCRATCH directory you just need to rsync the installation files to your working directory

cd $SCRATCH/my_working_directory/
rsync -arv $SCRATCH/mirna_setup/ ./

Then you need to create a project directory inside your job script directory and place your .sam and adapter.report file in that directory. In this example the project directory is called project.

Notice that the names of the adapter.report and .sam file must be in this format:

project/my_sorted.sam
project/my_sorted_adapter.report

Currently there is only support for the Dog genome. The UCSC database and species codes are

ucsc_database='canFam3'
ucsc_species_code='cfa'

The commands in your job script will look like this:

# run the mysqld command to start the database
./mysqld start &

sleep 120            # give mysqld_mirna time to initialize the database

annotate.pl -m mirna_21a -u canFam3 -o cfa -p project

# run the mysqld command to stop the database
./mysqld stop

The job script will launch a MySQL database on the compute node and run the analysis and then stop the MySQL database.

Differential Expression

Bowtie & Bowtie2

GCATemplates

Bowtie homepage

module spider Bowtie

Genome indexes for bowtie can be found in the ucsc directory for each organism:

/scratch/datasets/genome_indexes/ucsc/organism_name/bowtie_index

Bowtie2 homepage

module spider Bowtie2

Genome indexes for bowtie2 can be found in the ucsc directory for each organism:

/scratch/datasets/genome_indexes/ucsc/organism_name/bowtie2_index

Bowtie indexes will not work with the Bowtie2 aligner.

Bowtie2 indexes will not work with the Bowtie aligner.

Please contact the HPRC helpdesk help@sc.tamu.edu if you need additional genome indexes

Cufflinks

GCATemplates available: no

Cufflinks homepage

module spider Cufflinks

Cufflinks assembles transcripts, estimates their abundances, and tests for differential expression (cuffdiff) and regulation in RNA-Seq samples.

Cufflinks includes a program, “cuffdiff”, that you can use to find significant changes in transcript expression, splicing, and promoter use.

cummeRbund

GCATemplates available: no

CummeRbund is found in the R_tamu module which loads R along with many Bioconductor pacakges

module spider R_tamu

CummeRbund takes the various output files from a cuffdiff run and creates a SQLite database of the results describing appropriate relationships betweeen genes, transcripts, transcription start sites, and CDS regions.

library(cummeRbund)

Salmon

GCATemplates available: no

Salmon manual

Salmon is a wicked-fast program to produce a highly-accurate, transcript-level quantification estimates from RNA-seq data.

module spider Salmon

Sailfish

GCATemplates

Grace

Sailfish homepage

module spider Sailfish

Sailfish: Rapid Alignment-free Quantification of Isoform Abundance

edgeR

GCATemplates available: no

edgeR homepage

module spider R_tamu/3.2.0-iomkl-2015B-default-mt

Empirical analysis of digital gene expression data in R

edgeR Tutorial

DESeq

GCATemplates available: no

DESeq homepage

DESeq2 homepage

module load R_tamu/3.2.0-intel-2015B-default-mt

Differential gene expression analysis based on the negative binomial distribution

Here are good tutorials

DESeq2 tutorial DESeq2 tutorial(use module load R_tamu/3.3.1-intel-2015B-default-mt)

Ballgown

GCATemplates available: no

ballgown homepage

module load R_tamu/3.5.2-intel-2018b-recommended-mt

Tools for statistical analysis of assembled transcriptomes, including flexible differential expression analysis, visualization of transcript structures, and matching of assembled transcripts to annotation.

Ballgown_input_files(generated by StringTie for example): count files (*ctab files) for each sample

e_data.ctab: exon-level expression measurements.

i_data.ctab: intron- (i.e., junction-) level expression measurements

t_data.ctab: transcript-level expression measurements

e2t.ctab: table with two columns, e_id and t_id, denoting which exons belong to which transcripts

i2t.ctab: table with two columns, i_id and t_id, denoting which introns belong to which transcripts

kallisto

GCATemplates available: no

kallisto homepage

kallisto is a program for quantifying abundances of transcripts from RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads.

module load kallisto/0.45.0-foss-2018b

Sleuth

GCATemplates available: no

sleuth homepage usage

Sleuth is a program for analysis of RNA-Seq experiments for which transcript abundances have been quantified with kallisto. sleuth provides tools for exploratory data analysis utilizing Shiny by RStudio, and implements statistical algorithms for differential analysis that leverage the boostrap estimates of kallisto.

module load sleuth/0.28.0-iomkl-2015B-R-3.2.5-default-mt