Genome Assembly

NOTES

Effects of kmer size example

de novo

ABySS

Use the 'module spider' command to see the available versions of ABySS

module spider abyss

ABySS is a de novo, parallel, paired-end sequence assembler that is designed for short reads.

The single-processor version is useful for assembling genomes up to 100 Mbases in size.

The parallel version is implemented using MPI and is capable of assembling larger genomes.

The other ABySS 1.9.0 modules are configured with a maxk of 128.

SPAdes

GCATemplates

Grace (pe)

SPAdes homepage

module spider SPAdes

SPAdes was initially designed for small genomes.

The max k values is 128.

Add the following to your Grace job scripts when using all 48 cores of the 384GB compute nodes since SPAdes supports OpenMP (use 80 for the bigmem nodes):

export OMP_NUM_THREADS=48

The current version of SPAdes works with Illumina or IonTorrent reads and is capable of providing hybrid assemblies using PacBio, Oxford Nanopore and Sanger reads.

You can also provide additional contigs that will be used as long reads.

Unicycler

GCATemplates available: no

Unicycler is an assembly pipeline for bacterial genomes.

It circularises replicons without the need for a separate tool like Circlator.

It can assemble Illumina-only read sets where it functions as a SPAdes-optimiser. It can also assembly long-read-only sets (PacBio or Nanopore) where it runs a miniasm+Racon pipeline. For the best possible assemblies, give it both Illumina reads and long reads, and it will conduct a hybrid assembly.

module spider Unicycler

MaSuRCA

GCATemplates available: no

MaSuRCA homepage

module spider MaSuRCA

MaSuRCA is whole genome assembly software.

MaSuRCA can assemble data sets containing only short reads from Illumina sequencing or a mixture of short reads and long reads (Sanger, 454).

MaSuRCA version 3.2.1+ can utilize PacBio reads in the assembly.

IMPORTANT! Do not pre‐process Illumina data before providing it to MaSuRCA. Do not do any trimming, cleaning or error correction. This WILL deteriorate the assembly.

Velvet

GCATemplates available: no

Velvet homepage

module spider Velvet

Sequence assembler for very short reads

To see the configured max kmer length for a particular velvet module, run the following command and look at the MAXKMERLENGTH output:

velveth -h

SGA

GCATemplates available: no

SGA homepage

module spider SGA

SGA is a de novo assembler designed to assemble large genomes from high coverage short read data.

SGA implements a set of assembly algorithms based on the FM-index.

As the FM-index is a compressed data structure, the algorithms are very memory efficient.

ALLPATHS-LG

GCATemplates available: no

module spider ALLPATHS-LG

ALLPATHS-LG is a short read assembler and it works on both small and large (mammalian size) genomes.

To use it, you should first generate ~100 base Illumina reads from two libraries: one from ~180 bp fragments, and one from ~3000 bp fragments, both at about 45x coverage.

SOAPdenovo & SOAPdenovo2

GCATemplates available: no

SOAPdenovo homepage

SOAPdenovo2 homepage

module spider SOAPdenovo

or

module spider SOAPdenovo2

Scaffolding

SSPACE

GCATemplates available: no

SSPACE homepage

module spider SSPACE

SSPACE standard is a stand-alone program for scaffolding pre-assembled contigs using NGS paired-read data.

It is unique in offering the possibility to manually control the scaffolding process.

Opera

GCATemplates available: no

Opera homepage

module spider Opera

Opera uses information from paired-end/mate-pair reads to order and orient the intermediate contigs/scaffolds assembled in a genome assembly.

BESST

GCATemplates available: no

BESST homepage

module spider BESST

BESST is a package for scaffolding genomic assemblies.

It contains several modules for e.g. building a "contig graph" from available information, obtaining scaffolds from this graph, and accurate gap size information.

L_RNA_scaffolder

GCATemplates available: no

L_RNA_scaffolder homepage

module spider L_RNA_scaffolder

L_RNA_scaffolder is a genome scaffolding tool with long trancriptome reads.

The long transcriptome reads could be generated by 454/Sanger/Ion_Torrent sequencing, or de novo assembled with pair-end Illumina sequencing.

Gap Filling

GapFiller

GCATemplates available: no

module spider GapFiller

GapFiller is a stand-alone program for closing gaps within pre-assembled scaffolds.

Merge Assemblies

Metassembler

GCATemplates available: no

module spider Metassembler

Metassembler is a software package for reconciling assemblies produced by de novo short-read assemblers such as SOAPdenovo and ALLPATHS-LG.

The goal of assembly reconciliation, or "metassembly," is to combine multiple assemblies into a single genome that is superior to all of its constituents.

Improve Assemblies

Pilon

GCATemplates available: no

Pilon homepage

module spider Pilon

Pilon is a software tool which can be used if you have Illumina reads and PacBio to:

Automatically improve draft assemblies
Find variation among strains, including large event detection

Redundans

GCATemplates available: no

Redundans homepage

Redundans pipeline assists an assembly of heterozygous genomes. Program takes as input assembled contigs, sequencing libraries and/or reference sequence and returns scaffolded homozygous genome assembly. Final assembly should be less fragmented and with total size smaller than the input contigs. In addition, Redundans will automatically close the gaps resulting from genome assembly or scaffolding.

module spider Redundans

Gene Modeling

AUGUSTUS

GCATemplates available: no

Augustus homepage

module spider AUGUSTUS

Augustus is an open source system for building and scoring statistical models designed to work with data sets that are too large to fit into memory.

A list of available augustus species can be found here:

/sw/eb/sw/AUGUSTUS/3.4.0-foss-2020b/config/species/

You will need to copy the Augustus config directory to one of your $SCRATCH directories. $SCRATCH/my_augustus_config is a good place. Make sure you load the AUGUSTUS module first!

module load GCC/11.2.0  OpenMPI/4.1.1  AUGUSTUS/3.4.0
mkdir $SCRATCH/my_augustus_config_3.4.0
cp -r /sw/eb/sw/AUGUSTUS/3.4.0-foss-2020b/config/* $SCRATCH/my_augustus_config_3.4.0/
chmod --recursive u+w my_augustus_config_3.4.0/

You will also need to add the following in your job script after the module load line:

export AUGUSTUS_CONFIG_PATH="$SCRATCH/my_augustus_config_3.4.0"

GeneMark-ES

GCATemplates available: no

module spider GeneMark-ES

GeneMark-ES is for gene prediction in eukaryotic genomes

To use GeneMark-ES you need to download the GeneMark-ES licence key file.

http://topaz.gatech.edu/GeneMark/license_download.cgi

Select the following: GeneMark-ES and LINUX 64.

You do not need to download the program just the 64_bit key file.

Save the gm_key_64.gz to your $HOME directory.

Then gunzip the key file and rename it from gm_key_64 to .gm_key

GeneMarkS

GCATemplates available: no

module spider GeneMarkS

GeneMarkS is for gene prediction in prokaryotes, intron-less eukaryotes, eukaryotic viruses, phages and EST/cDNA sequences.

To use GeneMarkS you need to download the GeneMarkS licence key file.

http://topaz.gatech.edu/GeneMark/license_download.cgi

Select the following: GeneMark-ES/ET and LINUX 64.

You do not need to download the program just the 64_bit key file.

Save the gm_key_64.gz to your $HOME directory.

Then gunzip the key file and rename it from gm_key_64 to .gm_key

geneid

GCATemplates available: no

geneid homepage

module spider geneid

geneid is a program to predict genes in anonymous genomic sequences designed with a hierarchical structure.

RNAmmer

GCATemplates available: no

RNAmmer homepage

module spider RNAmmer

RNAmmer predicts ribosomal RNA genes in full genome sequences by utilising two levels of Hidden Markov Models: An initial spotter model searches both strands. The spotter model is constructed from highly conserved loci within a structural alignment of known rRNA sequences. Once the spotter model detects an approximate position of a gene, flanking regions are extracted and parsed to the full model which matches the entire gene. By enabling a two-level approach it is avoided to run a full model through an entire genome sequence allowing faster predictions.

SNAP-HMM

GCATemplates available: no

module spider SNAP-HMM

SNAP-HMM (Semi-HMM-based Nucleic Acid Parser) is a general purpose gene finding program suitable for both eukaryotic and prokaryotic genomes.

Genome Annotation

MAKER

GCATemplates

grace (using $TMPDIR)

MAKER homepage

module spider MAKER

MAKER is a portable and easily configurable genome annotation pipeline.

Its purpose is to allow smaller eukaryotic and prokaryotic genome projects to independently annotate their genomes and to create genome databases.

Here is a good paper to help you get started.

You need to do the following three steps prior to submitting a MAKER job script on an HPRC cluster

1. Download the GeneMark license key

- - To use MAKER you need to download the GeneMark-ES licence key file since GeneMark-ES is part of the MAKER pipeline.

Download here: http://topaz.gatech.edu/GeneMark/license_download.cgi

Select the following: GeneMark-ES/ET and LINUX 64.

You do not need to download the program just the 64_bit key file

Save the gm_key_64.gz to your $HOME directory.

Then gunzip the key file and rename it from gm_key_64 to .gm_key

2\. Create or copy the three required control files; you must edit
maker\_opts.ctl

- - Use the following commands to create the three maker control files in your current working directory:

module load GCC/9.3.0 OpenMPI/4.0.3 MAKER/3.01.03-Python-3.8.2
maker -CTL

- - 2a. maker_opts.ctl - You need to edit the maker_opts.ctl file based on your project. Edit maker_opts.ctl file to set cpus=20 when using #SBATCH --cpus-per-task=28 or specify cpus as a command option:

maker -cpus 20

- - Recommended: Set TMP= to $TMPDIR by using the following maker option

maker -TMP $TMPDIR

- - 2b. maker_bopts.ctl - You do not need to edit the maker_bopts.ctl file unless you want to adjust BLAST parameters.

- - 2c. maker_exe.ctl - You do not need to edit the maker_exe.ctl file since it is pre-configured with executable paths.

- - 3. Create a GeneMark HMM file

-

  -
    A GeneMark HMM file is needed if you want fasta sequences of
    predicted genes.

module load GeneMarkS/4.32 gmsn.pl -euk your_genome.fasta gm -m GeneMark.mat -R -lo -op your_genome.fasta

-

  -
    GeneMark-ES and
    GeneMarkS are
    installed which can generate the GeneMark HMM file.

-

  -
    Once your GeneMark\_hmm.mod file is generated, add it to the
    gmhmm value in the maker\_opts.ctl file.

- 4. Add an AUGUSTUS species in your maker_opts.ctl file at the line: augustus_species= -

      -
        You can find a list of AUGUSTUS species by loading the Maker
        module then looking in the directory:

ls $EBROOTAUGUSTUS/config/species

MAKER 2.31.10

Maker version 2.31.10 requires that you run two scripts after the maker command is complete.

cd dpp_contig.maker.output

fasta_merge -d dpp_contig_master_datastore_index.log
gff3_merge -d dpp_contig_master_datastore_index.log

Maker version 2.31.10 -help information:

MAKER version 2.31.10

Usage:

     maker [options] <maker_opts> <maker_bopts> <maker_exe>


Description:

     MAKER is a program that produces gene annotations in GFF3 format using
     evidence such as EST alignments and protein homology. MAKER can be used to
     produce gene annotations for new genomes as well as update annotations
     from existing genome databases.

     The three input arguments are control files that specify how MAKER should
     behave. All options for MAKER should be set in the control files, but a
     few can also be set on the command line. Command line options provide a
     convenient machanism to override commonly altered control file values.
     MAKER will automatically search for the control files in the current
     working directory if they are not specified on the command line.

     Input files listed in the control options files must be in fasta format
     unless otherwise specified. Please see MAKER documentation to learn more
     about control file  configuration.  MAKER will automatically try and
     locate the user control files in the current working directory if these
     arguments are not supplied when initializing MAKER.

     It is important to note that MAKER does not try and recalculated data that
     it has already calculated.  For example, if you run an analysis twice on
     the same dataset you will notice that MAKER does not rerun any of the
     BLAST analyses, but instead uses the blast analyses stored from the
     previous run. To force MAKER to rerun all analyses, use the -f flag.

     MAKER also supports parallelization via MPI on computer clusters. Just
     launch MAKER via mpiexec (i.e. mpiexec -n 40 maker). MPI support must be
     configured during the MAKER installation process for this to work though


Options:
     -genome|g <file>    Overrides the genome file path in the control files

     -RM_off|R           Turns all repeat masking options off.

     -datastore/         Forcably turn on/off MAKER's two deep directory
      nodatastore        structure for output.  Always on by default.

     -old_struct         Use the old directory styles (MAKER 2.26 and lower)

     -base    <string>   Set the base name MAKER uses to save output files.
                         MAKER uses the input genome file name by default.

     -tries|t <integer>  Run contigs up to the specified number of tries.

     -cpus|c  <integer>  Tells how many cpus to use for BLAST analysis.
                         Note: this is for BLAST and not for MPI!

     -force|f            Forces MAKER to delete old files before running again.
                         This will require all blast analyses to be rerun.

     -again|a            recaculate all annotations and output files even if no
                         settings have changed. Does not delete old analyses.

     -quiet|q            Regular quiet. Only a handlful of status messages.

     -qq                 Even more quiet. There are no status messages.

     -dsindex            Quickly generate datastore index file. Note that this
                         will not check if run settings have changed on contigs

     -nolock             Turn off file locks. May be usful on some file systems,
                         but can cause race conditions if running in parallel.

     -TMP                Specify temporary directory to use.

     -CTL                Generate empty control files in the current directory.

     -OPTS               Generates just the maker_opts.ctl file.

     -BOPTS              Generates just the maker_bopts.ctl file.

     -EXE                Generates just the maker_exe.ctl file.

     -MWAS    <option>   Easy way to control mwas_server for web-based GUI

                              options:  STOP
                                        START
                                        RESTART

     -version            Prints the MAKER version.

     -help|?             Prints this usage statement.

Funannotate

GCATemplates available: no

Funannotate homepage

module spider funannotate

Funannotate is a genome prediction, annotation, and comparison software package.

Prior to running funannotate in a job script, ou will need to the following

download the GeneMark license key as described in the GeneMark-ES section.
load the funannotate module and rsync the AUGUSTUS config to your $SCRATCH

Grace Example to run on login node command line prior to submitting your job script:

module load GCC/9.3.0  OpenMPI/4.0.3 funannotate/1.8.15-Python-3.8.2
mkdir $SCRATCH/my_augustus_config_3.4.0
rsync -r /sw/eb/sw/AUGUSTUS/3.4.0-gompi-2020a-Python-3.8.2 $SCRATCH/my_augustus_config_3.4.0

set the AUGUSTUS_CONFIG_PATH variable in your job script before the line that runs the funannotate command

export AUGUSTUS_CONFIG_PATH=$SCRATCH/my_augustus_config_3.4.0

BRAKER1

GCATemplates available: no

BRAKER1 homepage

module spider BRAKER1

BRAKER1: Unsupervised RNA-Seq-based genome annotation with GeneMark-ET and AUGUSTUS

Since AUGUSTUS is used, you will need to rsync the Augustus config to one of your directories. $SCRATCH/my_augustus_config is a good place.

module load GCC/10.2.0 OpenMPI/4.0.5 AUGUSTUS/3.4.0
mkdir $SCRATCH/my_augustus_config_3.4.0
rsync -r /sw/eb/sw/AUGUSTUS/3.4.0-foss-2020b/ $SCRATCH/my_augustus_config_3.4.0

You will also need to add the following in your job script:

export AUGUSTUS_CONFIG_PATH="$SCRATCH/my_augustus_config_3.4.0/config"

RepeatMasker

GCATemplates

RepeatMasker homepage

module spider RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.

Currently only RMBlast is the configured sequence search engine and is also the default.

The RebBase database now charges a fee for a subscription to use their repeat databases. Neither HPRC nor TAMU have a RebBase subscription. RepeatMasker is available on HPRC clusters with the default repeat databases provided by RepeatMasker and not the RebBase databases.

TRF

GCATemplates available: no

TRF homepage

module load trf/409-linux-x86_64

trf (Tandem Repeats Finder) is a program to locate and display tandem repeats in DNA sequences.

RepeatScout

GCATemplates

Grace

RepeatScout homepage

module load RepeatScout/1.0.5-GCCcore-6.3.0

RepeatScout is a tool to discover repetitive substrings in DNA. The purpose of the RepeatScout software is to identify repeat family sequences from genomes where hand-curated repeat databases (a la RepBase update) are not available. In fact, the output of this program can be used as input to RepeatMasker as a way of automatically masking newly-sequenced genomes.

PASA

GCATemplates available: no

PASA homepage

module load PASA/2.3.3-foss-2018b-Perl-5.28.0

PASA, acronym for Program to Assemble Spliced Alignments, is a eukaryotic genome annotation tool that exploits spliced alignments of expressed transcript sequences to automatically model gene structures, and to maintain gene structure annotation consistent with the most recently available experimental sequence data. PASA also identifies and classifies all splicing variations supported by the transcript alignments.

You can copy the PASA config files to your working directory and change the permissions. They are found here:

$PASAHOME/pasa_conf/

Genome Completeness

BUSCO

GCATemplates

BUSCO homepage

version 5.0.x

Version 5.0.0+ now uses Metaeuk as the default gene predictor instead of Augustus so you don't need to rsync the AUGUSTUS directory unless you want to use Augustus.

Databases for v5.0.0+ are found on Grace in the directory:

/scratch/data/bio/busco5/lineages

Contact the HPRC helpdesk if you need additional databases downloaded to the shared busco5 lineages directory.

version 4.0.x

To use BUSCO you need to copy the augustus config files (about 1000 files) to your $SCRATCH directory (only need to do once). Make sure you load the BUSCO module successfully by running 'module list' after the module load command.

module purge
module load BUSCO/4.0.5-foss-2019b-Python-3.7.4
module list
mkdir $SCRATCH/my_augustus_config_3.3.3
rsync -r /sw/eb/software/AUGUSTUS/3.3.3-foss-2019b/  $SCRATCH/my_augustus_config_3.3.3
chmod -R 755 $SCRATCH/my_augustus_config_3.3.3

You need to add the following in your job script:

export AUGUSTUS_CONFIG_PATH="$SCRATCH/my_augustus_config_3.3.3/config"

The lineages for version 4.0.x use odb10. Contact HPRC helpdesk if there is not a lineage for your organism in the odb10 directory.

Grace: /scratch/data/bio/busco4/lineages/

version 3.0.2b

To use BUSCO you need to copy the augustus config files (about 1000 files) to your $SCRATCH directory (only need to do once). Make sure you load the BUSCO module successfully by running 'module list' after the module load command.

module purge
module load GCC/10.2.0 OpenMPI/4.0.5 BUSCO/5.1.2
module list
mkdir $SCRATCH/my_augustus_config_3.4.0
rsync -r /sw/eb/sw/AUGUSTUS/3.4.0-foss-2020b/ $SCRATCH/my_augustus_config_3.4.0
chmod -R 755 $SCRATCH/my_augustus_config_3.4.0

You need to add the following in your job script:

export AUGUSTUS_CONFIG_PATH="$SCRATCH/my_augustus_config_3.4.0/config"

A list of available augustus species can be found here:

/sw/eb/sw/AUGUSTUS/3.4.0-gompi-2020a-Python-3.8.2/config/species/

A list of available busco lineages for BUSCO version 5 can be found here:

/scratch/data/bio/busco5/lineages/

CEGMA

GCATemplates available: no

module spider CEGMA

CEGMA (Core Eukaryotic Genes Mapping Approach) is used for building a highly reliable set of gene annotations in the absence of experimental data.

QUAST

GCATemplates available: no

QUAST homepage

module load QUAST/5.0.2-intel-2018b-Python-3.6.6

QUAST evaluates genome assemblies.

REAPR

GCATemplates available: no

REAPR homepage

module spider REAPR

REAPR is a tool that evaluates the accuracy of a genome assembly using mapped paired end reads, without the use of a reference genome for comparison.

Bandage

GCATemplates available: no

Bandage homepage

Bandage is a GUI program that allows users to interact with the assembly graphs made by de novo assemblers such as Velvet, SPAdes, MEGAHIT and others.

module load Bandage/0.8.1_Centos

Bandage is a GUI application that can run faster on the portal.hprc.tamu.edu

Start a VNC job on the portal and type the following on the command prompt once your portal job is launched:

module load Bandage/0.8.1_Centos
Bandage`