RNA-seq
TUTORIALS
VIDEO "How to analyze RNA-Seq data? Find differentially expressed genes in your research"
tutorials from Griffithlab on RNA-seq analysis workflow
example R script for DESeq2
Transcriptome Assembly
Trinity
module spider Trinity
Trinity homepage
Trinity tutorial videos
Trinity memory requirements summary
Trinity performs de novo or reference guided genome assemblies of transcript sequences from Illumina RNA-Seq data.
R Bioconductor libraries, such as qvalue, required by some Trinity analysis scripts are found in this R_tamu module
module load R_tamu/3.6.0-foss-2018b-recommended-mt
Trinity will create 100,000+ temporary files during an analysis run. A quota of 250,000 files (with about 225,000 available) should be enough for one Trinity run at a time. If you want to run multiple Trinity jobs at once, you can use run the analysis in $TMPDIR and copy the results files to your $SCRATCH working directory.
You need to use a directory after $TMPDIR with 'trinity' in the name which will get automatically created. Example: $TMPDIR/trinity_out
Trinity does have checkpoints. If your job times out and you run the same job script again, it should pick up where it left off unless you are using $TMPDIR.
If you are using Galaxy then checkpoints will not work.
NOTE: Trinity 2.4.0+ utilizes seqtk for read normalization which requires read headers to be in a specific format.
This version of Trinity requires either fastq headers in the new style:
@M01581:927:000000000-ARTAL:1:1101:19874:2078 1:N:0:1
or that the fastq headers end with /1 or /2 such as:
@M01581:927:000000000-ARTAL:1:1101:19874:2078/1
If your data come from SRA, be sure to dump the fastq file like so:
fastq-dump --defline-seq '@$sn[_$rn]/$ri' --split-files file.sra
Sample Trinity Paired End Assembly job Scripts
Grace: compute node
using all 28 cores for 7 days
#!/bin/bash
#SBATCH --job-name=trinity
#SBATCH --time=7-00:00:00
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=48
#SBATCH --mem=360G
#SBATCH --output=stdout.%x.%j
#SBATCH --error=stderr.%x.%j
module load GCC/9.3.0 OpenMPI/4.0.3 Trinity/2.12.0-Python-3.8.2
Trinity --workdir $TMPDIR/trinity_workdir --max_memory 300G --CPU 48 --inchworm_cpu 6 --no_version_check --seqType fq --left myreads_1.fastq.gz --right myreads_2.fastq.gz
The chrysalis step will generate thousands of small files in the $TMPDIR when specifying --workdir $TMPDIR/trinity_workdir which can help in preventing the job from creating more files than are available in your file quota.
Set the --max_memory value to a few GB below the #SBATCH --mem= to prevent reaching the memory limit with Trinity sub-processes.
The --full_cleanup option will remove all the thousands of intermediate temporary files once Trinity has completed.
If you use the --full_cleanup option (recommended), then the final Trinity.fasta file will be copied to the working directory and the outdir will be deleted.
If you do not see your Trinity.fasta file in the working directory then just submit the job script again and let it finish the final step.
If you do not use the --full_cleanup option, then the final Trinity.fasta file will be in the outdir. If you do not see it there then just submit the job script again and let it finish the final step.
using $TMPDIR
You can use the $TMPDIR to write all files to a local drive on the compute node. All files written to $TMPDIR do not count against your file quota so you don't have to request a file quota increase.
Example commands to use in your job script to run Trinity and copy the result files to your working directory:
Trinity --seqType fq --max_memory 300G --single myseqs.fastq.gz --CPU 48 --no_version_check --inchworm_cpu 6 --output $TMPDIR/trinity_out
cp $TMPDIR/trinity_out/Trinity.fasta.gene_trans_map ./
cp $TMPDIR/trinity_out/Trinity.fasta ./
When Trinity is complete, the two necessary results files are copied from the $TMPDIR to the working directory. Rename the files with a prefix if needed when copying the two result files.
If you use $TMPDIR no checkpoints are available. If the job fails due to running out of walltime or memory, all files are deleted and you have to start from the beginning.
You can't use $TMPDIR if using grid mode with tamulauncher.
Example Trinity Tutorials
If you need more than the 3TB memory Grace compute nodes, you can use external HPCs
Scripture
GCATemplates available: no
Scripture homepage
module spider Scripture
Scripture is a method for transcriptome reconstruction that relies solely on RNA-Seq reads and an assembled genome to build a transcriptome ab initio.
The statistical methods to estimate read coverage significance are also applicable to other sequencing data.
Scripture also has modules for ChIP-Seq peak calling.
StringTie
GCATemplates available: no
StringTie homepage
module spider StringTie
StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts.
To identify differentially expressed genes between experiments, StringTie's output can be processed either by the Cuffdiff or Ballgown programs.
Trans-ABySS
GCATemplates available: no
Trans-ABySS homepage
module spider Trans-ABySS
de novo assembly of RNA-Seq data using ABySS
The current version of the Trans-ABySS package comes with 3 main applications:
1. transabyss - assemble RNAseq data
2. transabyss-merge - merge multiple assemblies from (1)
3. transabyss-analyze - analyze assemblies, either from (1) or (2), for structural variants and splice variants. Requires reference genome and annotations.
SOAPdenovo-Trans
GCATemplates available: no
SOAPdenovo-Trans home
module spider SOAPdenovo-Trans
SOAPdenovo-Trans is a de novo transcriptome assembler basing on the SOAPdenovo framework, adapt to alternative splicing and different expression level among transcripts.The assembler provides a more accurate, complete and faster way to construct the full-length transcript sets.
For animal transcriptomes like mouse, about 30-35GB memory would be required.
Transcriptome Assembly Evaluation
DETONATE
GCATemplates available: no
DETONATE homepage
module spider DETONATE
DETONATE (DE novo TranscriptOme rNa-seq Assembly with or without the Truth Evaluation) consists of two component packages, RSEM-EVAL and REF-EVAL. Both packages are mainly intended to be used to evaluate de novo transcriptome assemblies, although REF-EVAL can be used to compare sets of any kinds of genomic sequences.
BUSCO
BUSCO homepage
version 5.0.x
Version 5.0.0+ now uses Metaeuk as the default gene predictor instead of Augustus so you don't need to rsync the AUGUSTUS directory unless you want to use Augustus.
Databases for v5.0.0+ are found on Grace in the directory:
/scratch/data/bio/busco5/lineages
Contact the HPRC helpdesk if you need additional databases downloaded to the shared busco5 lineages directory.
version 4.0.x
To use BUSCO you need to copy the augustus config files (about 1000 files) to your $SCRATCH directory (only need to do once). Make sure you load the BUSCO module successfully by running 'module list' after the module load command.
module purge
module load BUSCO/4.0.5-foss-2019b-Python-3.7.4
module list
mkdir $SCRATCH/my_augustus_config
rsync -r /sw/eb/software/AUGUSTUS/3.3.3-foss-2019b/ $SCRATCH/my_augustus_config
chmod -R 755 $SCRATCH/my_augustus_config
You need to add the following in your job script:
export AUGUSTUS_CONFIG_PATH="$SCRATCH/my_augustus_config/config"
The lineages for version 4.0.x use odb10. Contact HPRC helpdesk if there is not a lineage for your organism in the odb10 directory.
Grace: /scratch/data/bio/busco5/lineages/
Transcriptome Assembly Annotation
Trinotate
Available GCATemplates
Trinotate is a comprehensive annotation suite designed for automatic functional annotation of transcriptomes, particularly de novo assembled transcriptomes, from model or non-model organisms.
Trinotate home
# Grace
module load GCC/9.3.0 OpenMPI/4.0.3 Trinotate/3.2.2
Trinotate has a pipeline script that can be used to run all annotation steps at once.
Trinotate has checkpoints so if your job is stopped, you can restart the job and it should start from the last checkpoint.
You will need to copy the Trinotate.sqlite database from the following directory to your working directory. You should not have this line in your job script if you are restarting a stopped job. The following UNIX command will copy the necessary files to your current working directory.
Run this command on the UNIX command line prior to running your job:
module load GCC/9.3.0 OpenMPI/4.0.3 Trinotate/3.2.2
cp $EBROOTTRINOTATE/resources/Trinotate.sqlite ./
chmod 755 Trinotate.sqlite
Then in your job script, use the following two lines to create a gene to transcript map and run the pipeline on your Trinity.fasta assembled transcripts file:
$TRINITY_HOME/util/support_scripts/get_Trinity_gene_to_trans_map.pl Trinity.fasta > Trinity.fasta.gene_to_trans_map
$TRINOTATE_HOME/auto/autoTrinotate.pl --transcripts Trinity.fasta --CPU $SLURM_CPUS_PER_TASK \
--Trinotate_sqlite Trinotate.sqlite --gene_to_trans_map Trinity.fasta.gene_to_trans_map
TrinotateWeb
start an interactive job from the Grace command line
srun --time=1-00:00:00 --mem=56G --ntasks=1 --cpus-per-task=8 --pty bash
go to the directory where your Trinotate.sqlite file is located
cd $SCRATCH/mytrinotate/
build a directory from the singularity image file; this will create 22,006 files and directories singularity build --sandbox trinotate_web.dir /sw/hprc/sw/containers/trinotate_web_latest.sif
launch a shell which will take you inside the image to the Singularity> prompt
singularity shell -w -B $PWD:/data trinotate_web.dir
at the Singularity> prompt, run the command: run_TrinotateWebserver.pl 8080
Singularity> run_TrinotateWebserver.pl 8080
open a second terminal session in Grace and check the NODELIST column of the following command
squeue -u $USER
run the following in the terminal on your Mac or Linux desktop computer or your PowerShell if using Windows
use the compute node name (c536 in this example) of your job and keep the numbers 1234 and 8080 the same as below
ssh -N -L 1234:c536:8080 my_NetID@grace.hprc.tamu.edu
Open the following URL on your desktop computer's web browser http://localhost:1234/cgi-bin/index.cgi
enter the following for Path to Trinotate SQLite database:
/data/Trinotate.sqlite
type exit in the Grace terminal to end the interactive job when you are finished browsing the database
exit
hit Ctrl+c keys on your desktop terminal to end the ssh session when you are finished browsing the database
when you are done processing your project, delete the sandbox directory to free up disk space
rm -rf trinotate_web.dir
TransDecoder
GCATemplates available: no
TransDecoder home
module spider TransDecoder
TransDecoder identifies candidate coding regions within transcript sequences, such as those generated by de novo RNA-Seq transcript assembly using Trinity, or constructed based on RNA-Seq alignments to the genome using Tophat and Cufflinks.
miRNA
GCATemplates available: no
miRNA home
module spider miRNA
The BCGSC miRNA Profiling Pipeline produces expression profiles of known miRNAs from BWA-aligned BAM files and generates summary reports and graphs describing the results.
The miRNA application uses a local copy of the UCSC and miRBase databases. The UCSC database is populated only with the organism being analyzed.
First create a directory, go to that directory and run the startup script
You only need to run this once and it is run on the command line and not within a job script. It takes about 10 minutes to populate the UCSC and miRBase database files.
Once you successfully run this setup script, you do not need the script any more.
You can run miRNA from any $SCRATCH directory you just need to rsync the installation files to your working directory
Then you need to create a project directory inside your job script directory and place your .sam and adapter.report file in that directory. In this example the project directory is called project.
Notice that the names of the adapter.report and .sam file must be in this format:
project/my_sorted.sam
project/my_sorted_adapter.report
Currently there is only support for the Dog genome. The UCSC database and species codes are
The commands in your job script will look like this:
# run the mysqld command to start the database
./mysqld start &
sleep 120 # give mysqld_mirna time to initialize the database
annotate.pl -m mirna_21a -u canFam3 -o cfa -p project
# run the mysqld command to stop the database
./mysqld stop
The job script will launch a MySQL database on the compute node and run the analysis and then stop the MySQL database.
Differential Expression
Bowtie & Bowtie2
Bowtie homepage
module spider Bowtie
Genome indexes for bowtie can be found in the ucsc directory for each organism:
/scratch/datasets/genome_indexes/ucsc/organism_name/bowtie_index
Bowtie2 homepage
module spider Bowtie2
Genome indexes for bowtie2 can be found in the ucsc directory for each organism:
/scratch/datasets/genome_indexes/ucsc/organism_name/bowtie2_index
Bowtie indexes will not work with the Bowtie2 aligner.
Bowtie2 indexes will not work with the Bowtie aligner.
Please contact the HPRC helpdesk help@sc.tamu.edu if you need additional genome indexes
Cufflinks
GCATemplates available: no
Cufflinks homepage
module spider Cufflinks
Cufflinks assembles transcripts, estimates their abundances, and tests for differential expression (cuffdiff) and regulation in RNA-Seq samples.
Cufflinks includes a program, “cuffdiff”, that you can use to find significant changes in transcript expression, splicing, and promoter use.
cummeRbund
GCATemplates available: no
CummeRbund is found in the R_tamu module which loads R along with many Bioconductor pacakges
module spider R_tamu
CummeRbund takes the various output files from a cuffdiff run and creates a SQLite database of the results describing appropriate relationships betweeen genes, transcripts, transcription start sites, and CDS regions.
library(cummeRbund)
Salmon
GCATemplates available: no
Salmon manual
Salmon is a wicked-fast program to produce a highly-accurate, transcript-level quantification estimates from RNA-seq data.
module spider Salmon
Sailfish
Sailfish homepage
module spider Sailfish
Sailfish: Rapid Alignment-free Quantification of Isoform Abundance
edgeR
GCATemplates available: no
edgeR homepage
module spider R_tamu/3.2.0-iomkl-2015B-default-mt
Empirical analysis of digital gene expression data in R
edgeR Tutorial
DESeq
GCATemplates available: no
DESeq homepage
DESeq2 homepage
module load R_tamu/3.2.0-intel-2015B-default-mt
Differential gene expression analysis based on the negative binomial distribution
Here are good tutorials
DESeq2
tutorial
DESeq2
tutorial
(use module load R_tamu/3.3.1-intel-2015B-default-mt)
Ballgown
GCATemplates available: no
ballgown homepage
module load R_tamu/3.5.2-intel-2018b-recommended-mt
Tools for statistical analysis of assembled transcriptomes, including flexible differential expression analysis, visualization of transcript structures, and matching of assembled transcripts to annotation.
Ballgown_input_files(generated by StringTie for example): count files (*ctab files) for each sample
- e_data.ctab: exon-level expression measurements.
- i_data.ctab: intron- (i.e., junction-) level expression measurements
- t_data.ctab: transcript-level expression measurements
- e2t.ctab: table with two columns, e_id and t_id, denoting which exons belong to which transcripts
- i2t.ctab: table with two columns, i_id and t_id, denoting which introns belong to which transcripts
kallisto
GCATemplates available: no
kallisto homepage
kallisto is a program for quantifying abundances of transcripts from RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads.
module load kallisto/0.45.0-foss-2018b
Sleuth
GCATemplates available: no
Sleuth is a program for analysis of RNA-Seq experiments for which transcript abundances have been quantified with kallisto. sleuth provides tools for exploratory data analysis utilizing Shiny by RStudio, and implements statistical algorithms for differential analysis that leverage the boostrap estimates of kallisto.
module load sleuth/0.28.0-iomkl-2015B-R-3.2.5-default-mt