Genome Assembly
NOTES
Effects of kmer size example
de novo
ABySS
ABySS homepage
Use the 'module spider' command to see the available versions of ABySS
ABySS is a de novo, parallel, paired-end sequence assembler that is designed for short reads.
The single-processor version is useful for assembling genomes up to 100 Mbases in size.
The parallel version is implemented using MPI and is capable of assembling larger genomes.
The other ABySS 1.9.0 modules are configured with a maxk of 128.
SPAdes
SPAdes homepage
SPAdes was initially designed for small genomes.
The max k values is 128.
Add the following to your Grace job scripts when using all 48 cores of the 384GB compute nodes since SPAdes supports OpenMP (use 80 for the bigmem nodes):
export OMP_NUM_THREADS=48
The current version of SPAdes works with Illumina or IonTorrent reads and is capable of providing hybrid assemblies using PacBio, Oxford Nanopore and Sanger reads.
You can also provide additional contigs that will be used as long reads.
Unicycler
GCATemplates available: no
Unicycler is an assembly pipeline for bacterial genomes.
It circularises replicons without the need for a separate tool like Circlator.
It can assemble Illumina-only read sets where it functions as a SPAdes-optimiser. It can also assembly long-read-only sets (PacBio or Nanopore) where it runs a miniasm+Racon pipeline. For the best possible assemblies, give it both Illumina reads and long reads, and it will conduct a hybrid assembly.
module spider Unicycler
MaSuRCA
GCATemplates available: no
MaSuRCA homepage
MaSuRCA is whole genome assembly software.
MaSuRCA can assemble data sets containing only short reads from Illumina sequencing or a mixture of short reads and long reads (Sanger, 454).
MaSuRCA version 3.2.1+ can utilize PacBio reads in the assembly.
IMPORTANT! Do not pre‐process Illumina data before providing it to MaSuRCA. Do not do any trimming, cleaning or error correction. This WILL deteriorate the assembly.
Velvet
GCATemplates available: no
Velvet homepage
Sequence assembler for very short reads
To see the configured max kmer length for a particular velvet module, run the following command and look at the MAXKMERLENGTH output:
velveth -h
SGA
GCATemplates available: no
SGA homepage
SGA is a de novo assembler designed to assemble large genomes from high coverage short read data.
SGA implements a set of assembly algorithms based on the FM-index.
As the FM-index is a compressed data structure, the algorithms are very memory efficient.
ALLPATHS-LG
GCATemplates available: no
ALLPATHS-LG is a short read assembler and it works on both small and large (mammalian size) genomes.
To use it, you should first generate ~100 base Illumina reads from two libraries: one from ~180 bp fragments, and one from ~3000 bp fragments, both at about 45x coverage.
SOAPdenovo & SOAPdenovo2
GCATemplates available: no
SOAPdenovo homepage
SOAPdenovo2 homepage
or
Scaffolding
SSPACE
GCATemplates available: no
SSPACE homepage
SSPACE standard is a stand-alone program for scaffolding pre-assembled contigs using NGS paired-read data.
It is unique in offering the possibility to manually control the scaffolding process.
Opera
GCATemplates available: no
Opera homepage
Opera uses information from paired-end/mate-pair reads to order and orient the intermediate contigs/scaffolds assembled in a genome assembly.
BESST
GCATemplates available: no
BESST homepage
BESST is a package for scaffolding genomic assemblies.
It contains several modules for e.g. building a "contig graph" from available information, obtaining scaffolds from this graph, and accurate gap size information.
L_RNA_scaffolder
GCATemplates available: no
L_RNA_scaffolder homepage
L_RNA_scaffolder is a genome scaffolding tool with long trancriptome reads.
The long transcriptome reads could be generated by 454/Sanger/Ion_Torrent sequencing, or de novo assembled with pair-end Illumina sequencing.
Gap Filling
GapFiller
GCATemplates available: no
GapFiller is a stand-alone program for closing gaps within pre-assembled scaffolds.
Merge Assemblies
Metassembler
GCATemplates available: no
Metassembler is a software package for reconciling assemblies produced by de novo short-read assemblers such as SOAPdenovo and ALLPATHS-LG.
The goal of assembly reconciliation, or "metassembly," is to combine multiple assemblies into a single genome that is superior to all of its constituents.
Improve Assemblies
Pilon
GCATemplates available: no
Pilon homepage
Pilon is a software tool which can be used if you have Illumina reads and PacBio to:
- Automatically improve draft assemblies
- Find variation among strains, including large event detection
Redundans
GCATemplates available: no
Redundans homepage
Redundans pipeline assists an assembly of heterozygous genomes. Program takes as input assembled contigs, sequencing libraries and/or reference sequence and returns scaffolded homozygous genome assembly. Final assembly should be less fragmented and with total size smaller than the input contigs. In addition, Redundans will automatically close the gaps resulting from genome assembly or scaffolding.
Gene Modeling
AUGUSTUS
GCATemplates available: no
Augustus homepage
Augustus is an open source system for building and scoring statistical models designed to work with data sets that are too large to fit into memory.
A list of available augustus species can be found here:
/sw/eb/sw/AUGUSTUS/3.4.0-foss-2020b/config/species/
You will need to copy the Augustus config directory to one of your $SCRATCH directories. $SCRATCH/my_augustus_config is a good place. Make sure you load the AUGUSTUS module first!
module load GCC/11.2.0 OpenMPI/4.1.1 AUGUSTUS/3.4.0
mkdir $SCRATCH/my_augustus_config_3.4.0
cp -r /sw/eb/sw/AUGUSTUS/3.4.0-foss-2020b/config/* $SCRATCH/my_augustus_config_3.4.0/
chmod --recursive u+w my_augustus_config_3.4.0/
You will also need to add the following in your job script after the module load line:
export AUGUSTUS_CONFIG_PATH="$SCRATCH/my_augustus_config_3.4.0"
GeneMark-ES
GCATemplates available: no
GeneMark-ES is for gene prediction in eukaryotic genomes
To use GeneMark-ES you need to download the GeneMark-ES licence key file.
http://topaz.gatech.edu/GeneMark/license_download.cgi
Select the following: GeneMark-ES and LINUX 64.
You do not need to download the program just the 64_bit key file.
Save the gm_key_64.gz to your $HOME directory.
Then gunzip the key file and rename it from gm_key_64 to .gm_key
GeneMarkS
GCATemplates available: no
GeneMarkS is for gene prediction in prokaryotes, intron-less eukaryotes, eukaryotic viruses, phages and EST/cDNA sequences.
To use GeneMarkS you need to download the GeneMarkS licence key file.
http://topaz.gatech.edu/GeneMark/license_download.cgi
Select the following: GeneMark-ES/ET and LINUX 64.
You do not need to download the program just the 64_bit key file.
Save the gm_key_64.gz to your $HOME directory.
Then gunzip the key file and rename it from gm_key_64 to .gm_key
geneid
GCATemplates available: no
geneid homepage
geneid is a program to predict genes in anonymous genomic sequences designed with a hierarchical structure.
RNAmmer
GCATemplates available: no
RNAmmer homepage
RNAmmer predicts ribosomal RNA genes in full genome sequences by utilising two levels of Hidden Markov Models: An initial spotter model searches both strands. The spotter model is constructed from highly conserved loci within a structural alignment of known rRNA sequences. Once the spotter model detects an approximate position of a gene, flanking regions are extracted and parsed to the full model which matches the entire gene. By enabling a two-level approach it is avoided to run a full model through an entire genome sequence allowing faster predictions.
SNAP-HMM
GCATemplates available: no
SNAP-HMM (Semi-HMM-based Nucleic Acid Parser) is a general purpose gene finding program suitable for both eukaryotic and prokaryotic genomes.
Genome Annotation
MAKER
MAKER homepage
MAKER is a portable and easily configurable genome annotation pipeline.
Its purpose is to allow smaller eukaryotic and prokaryotic genome projects to independently annotate their genomes and to create genome databases.
Here is a good paper to help you get started.
You need to do the following three steps prior to submitting a MAKER job script on an HPRC cluster
1. Download the GeneMark license key
-
-
To use MAKER you need to download the GeneMark-ES licence key file since GeneMark-ES is part of the MAKER pipeline.
-
Download here: http://topaz.gatech.edu/GeneMark/license_download.cgi
-
Select the following: GeneMark-ES/ET and LINUX 64.
-
You do not need to download the program just the 64_bit key file
-
Save the gm_key_64.gz to your $HOME directory.
-
Then gunzip the key file and rename it from gm_key_64 to .gm_key
2. Create or copy the three required control files; you must edit maker_opts.ctl
-
-
- Use the following commands to create the three maker control files in your current working directory:
-
2a. maker_opts.ctl -
You need to edit the maker_opts.ctl file based on your project. Edit maker_opts.ctl file to set cpus=20 when using #SBATCH --cpus-per-task=28 or specify cpus as a command option:
maker -cpus 20
-
-
Recommended: Set TMP= to $TMPDIR by using the following maker option
-
maker -TMP $TMPDIR
-
-
2b. maker_bopts.ctl
-
You do not need to edit the maker_bopts.ctl file unless you want to adjust BLAST parameters.
-
-
-
-
2c. maker_exe.ctl
-
You do not need to edit the maker_exe.ctl file since it is pre-configured with executable paths.
-
-
-
- 3. Create a GeneMark HMM file
-
-
A GeneMark HMM file is needed if you want fasta sequences of predicted genes.
-
module load GeneMarkS/4.32
gmsn.pl -euk your_genome.fasta
gm -m GeneMark.mat -R -lo -op your_genome.fasta
-
-
GeneMark-ES and GeneMarkS are installed which can generate the GeneMark HMM file.
-
-
-
Once your GeneMark_hmm.mod file is generated, add it to the gmhmm value in the maker_opts.ctl file.
-
-
4\. Add an AUGUSTUS species in your maker\_opts.ctl file at the line: augustus\_species=
-
-
You can find a list of AUGUSTUS species by loading the Maker module then looking in the directory:
-
-
ls $EBROOTAUGUSTUS/config/species
MAKER 2.31.10
Maker version 2.31.10 requires that you run two scripts after the maker command is complete.
cd dpp_contig.maker.output
fasta_merge -d dpp_contig_master_datastore_index.log
gff3_merge -d dpp_contig_master_datastore_index.log
Maker version 2.31.10 -help information:
MAKER version 2.31.10
Usage:
maker [options] <maker_opts> <maker_bopts> <maker_exe>
Description:
MAKER is a program that produces gene annotations in GFF3 format using
evidence such as EST alignments and protein homology. MAKER can be used to
produce gene annotations for new genomes as well as update annotations
from existing genome databases.
The three input arguments are control files that specify how MAKER should
behave. All options for MAKER should be set in the control files, but a
few can also be set on the command line. Command line options provide a
convenient machanism to override commonly altered control file values.
MAKER will automatically search for the control files in the current
working directory if they are not specified on the command line.
Input files listed in the control options files must be in fasta format
unless otherwise specified. Please see MAKER documentation to learn more
about control file configuration. MAKER will automatically try and
locate the user control files in the current working directory if these
arguments are not supplied when initializing MAKER.
It is important to note that MAKER does not try and recalculated data that
it has already calculated. For example, if you run an analysis twice on
the same dataset you will notice that MAKER does not rerun any of the
BLAST analyses, but instead uses the blast analyses stored from the
previous run. To force MAKER to rerun all analyses, use the -f flag.
MAKER also supports parallelization via MPI on computer clusters. Just
launch MAKER via mpiexec (i.e. mpiexec -n 40 maker). MPI support must be
configured during the MAKER installation process for this to work though
Options:
-genome|g <file> Overrides the genome file path in the control files
-RM_off|R Turns all repeat masking options off.
-datastore/ Forcably turn on/off MAKER's two deep directory
nodatastore structure for output. Always on by default.
-old_struct Use the old directory styles (MAKER 2.26 and lower)
-base <string> Set the base name MAKER uses to save output files.
MAKER uses the input genome file name by default.
-tries|t <integer> Run contigs up to the specified number of tries.
-cpus|c <integer> Tells how many cpus to use for BLAST analysis.
Note: this is for BLAST and not for MPI!
-force|f Forces MAKER to delete old files before running again.
This will require all blast analyses to be rerun.
-again|a recaculate all annotations and output files even if no
settings have changed. Does not delete old analyses.
-quiet|q Regular quiet. Only a handlful of status messages.
-qq Even more quiet. There are no status messages.
-dsindex Quickly generate datastore index file. Note that this
will not check if run settings have changed on contigs
-nolock Turn off file locks. May be usful on some file systems,
but can cause race conditions if running in parallel.
-TMP Specify temporary directory to use.
-CTL Generate empty control files in the current directory.
-OPTS Generates just the maker_opts.ctl file.
-BOPTS Generates just the maker_bopts.ctl file.
-EXE Generates just the maker_exe.ctl file.
-MWAS <option> Easy way to control mwas_server for web-based GUI
options: STOP
START
RESTART
-version Prints the MAKER version.
-help|? Prints this usage statement.
Funannotate
GCATemplates available: no
Funannotate homepage
Funannotate is a genome prediction, annotation, and comparison software package.
Prior to running funannotate in a job script, ou will need to the following
-
download the GeneMark license key as described in the GeneMark-ES section.
-
load the funannotate module and rsync the AUGUSTUS config to your $SCRATCH
Grace Example to run on login node command line prior to submitting your job script:
module load GCC/9.3.0 OpenMPI/4.0.3 funannotate/1.8.15-Python-3.8.2
mkdir $SCRATCH/my_augustus_config_3.4.0
rsync -r /sw/eb/sw/AUGUSTUS/3.4.0-gompi-2020a-Python-3.8.2 $SCRATCH/my_augustus_config_3.4.0
- set the AUGUSTUS_CONFIG_PATH variable in your job script before the line that runs the funannotate command
export AUGUSTUS_CONFIG_PATH=$SCRATCH/my_augustus_config_3.4.0
BRAKER1
GCATemplates available: no
BRAKER1 homepage
BRAKER1: Unsupervised RNA-Seq-based genome annotation with GeneMark-ET and AUGUSTUS
Since AUGUSTUS is used, you will need to rsync the Augustus config to one of your directories. $SCRATCH/my_augustus_config is a good place.
module load GCC/10.2.0 OpenMPI/4.0.5 AUGUSTUS/3.4.0
mkdir $SCRATCH/my_augustus_config_3.4.0
rsync -r /sw/eb/sw/AUGUSTUS/3.4.0-foss-2020b/ $SCRATCH/my_augustus_config_3.4.0
You will also need to add the following in your job script:
export AUGUSTUS_CONFIG_PATH="$SCRATCH/my_augustus_config_3.4.0/config"
RepeatMasker
RepeatMasker homepage
RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Currently only RMBlast is the configured sequence search engine and is also the default.
The RebBase database now charges a fee for a subscription to use their repeat databases. Neither HPRC nor TAMU have a RebBase subscription. RepeatMasker is available on HPRC clusters with the default repeat databases provided by RepeatMasker and not the RebBase databases.
TRF
GCATemplates available: no
TRF homepage
trf (Tandem Repeats Finder) is a program to locate and display tandem repeats in DNA sequences.
RepeatScout
RepeatScout homepage
RepeatScout is a tool to discover repetitive substrings in DNA. The purpose of the RepeatScout software is to identify repeat family sequences from genomes where hand-curated repeat databases (a la RepBase update) are not available. In fact, the output of this program can be used as input to RepeatMasker as a way of automatically masking newly-sequenced genomes.
PASA
GCATemplates available: no
PASA homepage
PASA, acronym for Program to Assemble Spliced Alignments, is a eukaryotic genome annotation tool that exploits spliced alignments of expressed transcript sequences to automatically model gene structures, and to maintain gene structure annotation consistent with the most recently available experimental sequence data. PASA also identifies and classifies all splicing variations supported by the transcript alignments.
You can copy the PASA config files to your working directory and change the permissions. They are found here:
$PASAHOME/pasa_conf/
Genome Completeness
BUSCO
BUSCO homepage
version 5.0.x
Version 5.0.0+ now uses Metaeuk as the default gene predictor instead of Augustus so you don't need to rsync the AUGUSTUS directory unless you want to use Augustus.
Databases for v5.0.0+ are found on Grace in the directory:
/scratch/data/bio/busco5/lineages
Contact the HPRC helpdesk if you need additional databases downloaded to the shared busco5 lineages directory.
version 4.0.x
To use BUSCO you need to copy the augustus config files (about 1000 files) to your $SCRATCH directory (only need to do once). Make sure you load the BUSCO module successfully by running 'module list' after the module load command.
module purge
module load BUSCO/4.0.5-foss-2019b-Python-3.7.4
module list
mkdir $SCRATCH/my_augustus_config_3.3.3
rsync -r /sw/eb/software/AUGUSTUS/3.3.3-foss-2019b/ $SCRATCH/my_augustus_config_3.3.3
chmod -R 755 $SCRATCH/my_augustus_config_3.3.3
You need to add the following in your job script:
export AUGUSTUS_CONFIG_PATH="$SCRATCH/my_augustus_config_3.3.3/config"
The lineages for version 4.0.x use odb10. Contact HPRC helpdesk if there is not a lineage for your organism in the odb10 directory.
Grace: /scratch/data/bio/busco4/lineages/
version 3.0.2b
To use BUSCO you need to copy the augustus config files (about 1000 files) to your $SCRATCH directory (only need to do once). Make sure you load the BUSCO module successfully by running 'module list' after the module load command.
module purge
module load GCC/10.2.0 OpenMPI/4.0.5 BUSCO/5.1.2
module list
mkdir $SCRATCH/my_augustus_config_3.4.0
rsync -r /sw/eb/sw/AUGUSTUS/3.4.0-foss-2020b/ $SCRATCH/my_augustus_config_3.4.0
chmod -R 755 $SCRATCH/my_augustus_config_3.4.0
You need to add the following in your job script:
export AUGUSTUS_CONFIG_PATH="$SCRATCH/my_augustus_config_3.4.0/config"
A list of available augustus species can be found here:
/sw/eb/sw/AUGUSTUS/3.4.0-gompi-2020a-Python-3.8.2/config/species/
A list of available busco lineages for BUSCO version 5 can be found here:
/scratch/data/bio/busco5/lineages/
CEGMA
GCATemplates available: no
CEGMA (Core Eukaryotic Genes Mapping Approach) is used for building a highly reliable set of gene annotations in the absence of experimental data.
QUAST
GCATemplates available: no
QUAST homepage
QUAST evaluates genome assemblies.
REAPR
GCATemplates available: no
REAPR homepage
REAPR is a tool that evaluates the accuracy of a genome assembly using mapped paired end reads, without the use of a reference genome for comparison.
Bandage
GCATemplates available: no
Bandage homepage
Bandage is a GUI program that allows users to interact with the assembly graphs made by de novo assemblers such as Velvet, SPAdes, MEGAHIT and others.
Bandage is a GUI application that can run faster on the portal.hprc.tamu.edu
Start a VNC job on the portal and type the following on the command prompt once your portal job is launched: