PacBio tools

Tutorials

2019 HPRC Research Computing Symposium - Long Read Tools slides

List of long-read software long-read-tools.org

Formats

subreads file

Below is an example ID string from a subreads.bam file with fields defined:

m54001_160302_121501.subreads.bam
1└─2─┘ └─────3─────┘└───4───┘ └5┘

[1] "m" = movie

[2] Instrument Serial Number

[3] Time of Run Start (`yymmdd_hhmmss`)

[4] File Descriptor

[5] File Extension

pls2fasta

GCATemplates available: no

pls2fasta Converts plx.h5/bax.h5/fofn files to fasta or fastq files. Although fasta files are provided with every run, they are not trimmed nor split into subreads. This program takes additional annotation information, such as the subread coordinates and high quality regions and uses them to create fasta sequences that are substrings of all bases called. Most of the time you will want to trim low quality reads, so you should specify -trimByRegion.

pls2fasta is found in the blasr module

module load blasr/5.3.0

Tool Suites

pbbioconda

GCATemplates available: no

Command line tool versions of the SMRT Analysis build.

https://github.com/PacificBiosciences/pbbioconda

Ask if you would like pbbioconda installed or install your own conda environment

See a list of available packages here.

SMRT Link

Command line tool versions of the SMRT Link application.

module spider SMRT-Link

Available on Grace.

Version 8

Version 8 is for Sequel and Sequel II

Version 8 user guide

Version 7

Version 7 is for RS II

Version 7 user guide

See an example of running the BaseMod analysis on the command line here.

Error correction

Racon

GCATemplates available: no

Consensus module for raw de novo DNA assembly of long uncorrected reads.

Racon is intended as a standalone consensus module to correct raw contigs generated by rapid assembly methods which do not include a consensus step.

module load Racon/1.3.1-GCCcore-6.3.0

LSC

GCATemplates available: no

LSC homepage

module spider LSC

LSC is a long read error correction tool. It offers fast correction with high sensitivity and good accuracy.

Proovread

GCATemplates

Ada

Proovread homepage

module load Proovread/2.12-intel-2015B

Large-scale high accuracy PacBio correction through iterative short read consensus.

To get an idea of the computational time needed, a single PacBio file 46GB in size was chunked into 40M reads per file totaling 1,168 files:

SeqChunker -s 40M -o pb-%03d.fq pacbio_reads.fq

Each file of 40M reads took approximately 1 hour to correct utilizing 20 cores and ~2GB memory (using only 10 cores took 2 hours). Adding the -u unitigs.fasta option still took about 1 hour to complete on 20 cores.

The Illumina paired end files used for the correction were 12GB each.

All input files should not be gzipped.

Genome assembly

Unicycler

GCATemplates available: no

Unicycler is an assembly pipeline for bacterial genomes.

It circularises replicons without the need for a separate tool like Circlator.

It can assemble Illumina-only read sets where it functions as a SPAdes-optimiser. It can also assembly long-read-only sets (PacBio or Nanopore) where it runs a miniasm+Racon pipeline. For the best possible assemblies, give it both Illumina reads and long reads, and it will conduct a hybrid assembly.

module load Unicycler/0.4.6-intel-2017A-Python-3.5.2

wtdbg2

GCATemplates

Grace

wtdbg2 homepage

WTDBG: De novo assembler for long noisy sequences

Wtdbg2 is a de novo sequence assembler for long noisy reads produced by PacBio or Oxford Nanopore Technologies (ONT). It assembles raw reads without error correction and then builds the consensus from intermediate assembly output.

module spider wtdbg2

Flye

GCATemplates

Grace

Flye homepage

module spider Flye

Flye is a de novo assembler for single molecule sequencing reads, such as those produced by PacBio and Oxford Nanopore Technologies.

It is designed for a wide range of datasets, from small bacterial projects to large mammalian-scale assemblies.

The package represents a complete pipeline: it takes raw PacBio / ONT reads as input and outputs polished contigs.

Flye also has a special mode for metagenome assembly.

Assembly polishing

Arrow

GCATemplates available: no

Arrow homepage

The GenomicConsensus package provides the variantCaller tool, which allows you to apply the Quiver or Arrow algorithm to mapped PacBio reads to get consensus and variant calls.

Arrow is useful if you only have PacBio sequence for genome polishing.

Arrow is installed in the pbbioconda Anaconda shared environment and can be utilized by loading the following modules.

module load Anaconda/2-5.0.1
module load BamTools/2.5.1-GCCcore-7.3.0
source activate pbbioconda

Running arrow on a single node may complete successfully for a bacterial genome with 50x coverage.

Polishing large genomes with arrow can take weeks if run on a single compute node.

It is recommended to use ArrowGrid to do genome polishing on large genomes starting with your unaligned PacBio subreads.bam files that you receive from the sequencing center.

ArrowGrid_HPRC

GCATemplates available: no

ArrowGrid_HPRC is a modified version of the ArrowGrid package.

ArrowGrid_HPRC will be added to Grace (SLURM).

It requires 65,000 SUs to submit a job that has 29 subreads.bam files

To estimate the number of required SUs to start an ArrowGrid_HPRC job, multiply the number of subreads.bam files * 2000 and add 7000

(number_of_subreads.bam_files * 2000) + 7000

The default is to run each array job for 3 days. You can change this using the -w option for arrow.sh (see below).

In this case you could estimate the number of required SUs as:

(number_of_subreads.bam_files * 28 * hours) + 7000

Example ArrowGrid_HPRC Job Script:

#!/bin/bash
#SBATCH --export=NONE               # do not export current env to the job
#SBATCH --job-name=arrow_polish     # set job name
#SBATCH --time=1-00:00:00           # Set the wall clock limit to 1 day
#SBATCH --nodes=1                   # uses 1 node
#SBATCH --ntasks-per-node=1         # run 1 task (command) per node
#SBATCH --cpus-per-task=28          # each task (command) uses 28 cores
#SBATCH --mem=56G                   # total memory per node
#SBATCH --output=stdout.%x.%j          # create a file for stdout
#SBATCH --error=stderr.%x.%j

module load ArrowGrid_HPRC/0.6.0-iomkl-2017b
source activate pbbioconda

arrow.sh -g genome.fasta -p prefix

You need a file named input.fofn in your working directory that contains one .subreads.bam file per line.

Helpful tips for polishing a large genome (polishing example below is for a 29 PacBio .bam files at 50x coverage):

Start with the 29 unaligned PacBio bam files.
- create a file named input.fofn which contains a list of the 29 PacBio unaligned.bam files with paths (one per line)
Your initial job script is only for running arrow.sh which configures the ArrowGrid array jobs so you do not need to submit a job array.
There are two parallel stages, align and arrow, the array jobs will show up in job NAME of the command squeue -l -u $USER prefixed as align_ and cns_ respectively.
- if the alignment stage completes and the arrow stage fails, then reconfigure the SLURM parameters as needed and submit the job script again and the second stage will start from where it left off.
When all 29 arrow array jobs are complete, there is a final merge step which results in a .fastq, .fasta and .gff file of your polished genome
A test run of arrow on 29 unaligned PadBio.bam files for 50x coverage of a genome sized 3Gb completed in 36 hours using ArrowGrid across 29 compute nodes.

------------------------------------------------------------------------
                      Run Arrow in grid mode
------------------------------------------------------------------------

Usage: arrow.sh -g <genome_assembly.fasta> -p <prefix>
    Required:
           -g genome.fasta      : GENOME: genome assembly fasta file (with .fai index file in same directory)
    Optional:
           -p prefix            : PREFIX: prefix for output files (default: arrow_polish)
           -r <text>            : ALGORITHM: arrow, quiver (default: arrow)
           -w <text>            : walltime for array jobs (default: SLURM: 3-00:00:00)
           -u <int>             : USEGRID: 0, 1 (default: 1)
      alignment step:           : alignment step values apply to variantCaller step if varantCaller step values are not used
           -a <text>            : additional alignment job submission parameters: Ex: -a '--mail-type=ALL [email protected]'
           -c <int>             : CPUs for each alignment (blasr) array job (default: 28)
           -m <int>             : memory per node (default: 56G)
           -k <int>             : SLURM tasks per node (default: 1)
           -q <text>            : queue (paritition) for array jobs (default: SLURM: long)
      variantCaller step:       : values for variantCaller step will equal alignment step if the following are not used
           -A <text>            : additional variantCaller job submission parameters: Ex: -A '--mail-type=ALL [email protected]'
           -C <int>             : CPUs for each variantCaller array job, each CPU needs about 5GB memory (default: 14)
           -M <int>             : memory per node (default: 56G)
           -K <int>             : SLURM tasks per node (default: 1)
           -Q <text>            : queue (paritition) for array jobs (default: SLURM: long)

           -h                   : print this help message.

If you use ArrowGrid, be sure to cite ArrowGrid: https://github.com/skoren/ArrowGrid

Purge_Haplotigs

GCATemplates available: no

Pipeline to help with curating heterozygous diploid genome assemblies.

Some parts of a genome may have a very high degree of heterozygosity. This causes contigs for both haplotypes of that part of the genome to be assembled as separate primary contigs, rather than as a contig and an associated haplotig.

module load Purge_Haplotigs/1.0.3-iomkl-2017b-R-3.5.0-recommended-mt

Redundans

GCATemplates

Ada

Redundans homepage

Redundans pipeline assists an assembly of heterozygous genomes. Program takes as input assembled contigs, sequencing libraries and/or reference sequence and returns scaffolded homozygous genome assembly. Final assembly should be less fragmented and with total size smaller than the input contigs. In addition, Redundans will automatically close the gaps resulting from genome assembly or scaffolding.

module load Redundans/0.13c-intel-2017b-Python-2.7.14

Circlator

GCATemplates

Ada

Circlator homepage

A tool to circularize genome assemblies.

Circlator will attempt to identify each circular sequence and output a linearised version of it. It does this by assembling all reads that map to contig ends and comparing the resulting contigs with the input assembly.

module load Circlator/1.5.5-intel-2017A-Python-3.5.2

wtdbg2

GCATemplates

Grace

wtdbg2 homepage

WTDBG: De novo assembler for long noisy sequences

Wtdbg2 is a de novo sequence assembler for long noisy reads produced by PacBio or Oxford Nanopore Technologies (ONT). It assembles raw reads without error correction and then builds the consensus from intermediate assembly output.

module spider wtdbg2

Sequence alignments

Minimap2

GCATemplates

Grace

Minimap2 homepage

Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database.

module spider minimap2

pbalign

GCATemplates

Ada

pbalign homepage

pbalign maps PacBio reads to reference sequences.

module load SMRT-Link/10.1.0.119588-cli-tools-only

Also be sure to specify the number of OMP threads. If you are using 20 cores "#SBATCH --cpus-per-task=28" then use:

export OMP_NUM_THREADS=20

blasr

GCATemplates available: no

blasr homepage

The PacBio long read aligner

module load blasr/5.3.0

Also be sure to specify the number of OMP threads. If you are using 20 cores "#BSUB -n 20" then use:

export OMP_NUM_THREADS=20

PBSuite

GCATemplates available: no

PBSuite homepage

module spider PBSuite

PBJelly is a highly automated pipeline that aligns long sequencing reads (such as PacBio RS reads or long 454 reads in fasta format) to high-confidence draft assemblies.

PBJelly fills or reduces as many captured gaps as possible to produce upgraded draft genomes.