PacBio tools
Tutorials
2019 HPRC Research Computing Symposium - Long Read Tools slides
List of long-read software long-read-tools.org
Formats
subreads file
Below is an example ID string from a subreads.bam file with fields defined:
m54001_160302_121501.subreads.bam
1└─2─┘ └─────3─────┘└───4───┘ └5┘
[1] "m" = movie
[2] Instrument Serial Number
[3] Time of Run Start (`yymmdd_hhmmss`)
[4] File Descriptor
[5] File Extension
pls2fasta
GCATemplates available: no
pls2fasta Converts plx.h5/bax.h5/fofn files to fasta or fastq files. Although fasta files are provided with every run, they are not trimmed nor split into subreads. This program takes additional annotation information, such as the subread coordinates and high quality regions and uses them to create fasta sequences that are substrings of all bases called. Most of the time you will want to trim low quality reads, so you should specify -trimByRegion.
pls2fasta is found in the blasr module
module load blasr/5.3.0
Tool Suites
pbbioconda
GCATemplates available: no
Command line tool versions of the SMRT Analysis build.
https://github.com/PacificBiosciences/pbbioconda
Ask if you would like pbbioconda installed or install your own conda environment
See a list of available packages here.
SMRT Link
Command line tool versions of the SMRT Link application.
module spider SMRT-Link
Available on Grace.
Version 8
Version 8 is for Sequel and Sequel II
Version 8 user guide
Version 7
Version 7 is for RS II
Version 7 user guide
See an example of running the BaseMod analysis on the command line here.
Error correction
Racon
GCATemplates available: no
Consensus module for raw de novo DNA assembly of long uncorrected reads.
Racon is intended as a standalone consensus module to correct raw contigs generated by rapid assembly methods which do not include a consensus step.
module load Racon/1.3.1-GCCcore-6.3.0
LSC
GCATemplates available: no
LSC homepage
module spider LSC
LSC is a long read error correction tool. It offers fast correction with high sensitivity and good accuracy.
Proovread
Proovread homepage
Large-scale high accuracy PacBio correction through iterative short read consensus.
To get an idea of the computational time needed, a single PacBio file 46GB in size was chunked into 40M reads per file totaling 1,168 files:
SeqChunker -s 40M -o pb-%03d.fq pacbio_reads.fq
Each file of 40M reads took approximately 1 hour to correct utilizing 20 cores and ~2GB memory (using only 10 cores took 2 hours). Adding the -u unitigs.fasta option still took about 1 hour to complete on 20 cores.
The Illumina paired end files used for the correction were 12GB each.
All input files should not be gzipped.
Genome assembly
Unicycler
GCATemplates available: no
Unicycler is an assembly pipeline for bacterial genomes.
It circularises replicons without the need for a separate tool like Circlator.
It can assemble Illumina-only read sets where it functions as a SPAdes-optimiser. It can also assembly long-read-only sets (PacBio or Nanopore) where it runs a miniasm+Racon pipeline. For the best possible assemblies, give it both Illumina reads and long reads, and it will conduct a hybrid assembly.
wtdbg2
wtdbg2 homepage
WTDBG: De novo assembler for long noisy sequences
Wtdbg2 is a de novo sequence assembler for long noisy reads produced by PacBio or Oxford Nanopore Technologies (ONT). It assembles raw reads without error correction and then builds the consensus from intermediate assembly output.
Flye
Flye homepage
Flye is a de novo assembler for single molecule sequencing reads, such as those produced by PacBio and Oxford Nanopore Technologies.
It is designed for a wide range of datasets, from small bacterial projects to large mammalian-scale assemblies.
The package represents a complete pipeline: it takes raw PacBio / ONT reads as input and outputs polished contigs.
Flye also has a special mode for metagenome assembly.
Assembly polishing
Arrow
GCATemplates available: no
Arrow homepage
The GenomicConsensus package provides the variantCaller tool, which allows you to apply the Quiver or Arrow algorithm to mapped PacBio reads to get consensus and variant calls.
Arrow is useful if you only have PacBio sequence for genome polishing.
Arrow is installed in the pbbioconda Anaconda shared environment and can be utilized by loading the following modules.
module load Anaconda/2-5.0.1
module load BamTools/2.5.1-GCCcore-7.3.0
source activate pbbioconda
Running arrow on a single node may complete successfully for a bacterial genome with 50x coverage.
Polishing large genomes with arrow can take weeks if run on a single compute node.
It is recommended to use ArrowGrid to do genome polishing on large genomes starting with your unaligned PacBio subreads.bam files that you receive from the sequencing center.
ArrowGrid_HPRC
GCATemplates available: no
ArrowGrid_HPRC is a modified version of the ArrowGrid package.
ArrowGrid_HPRC will be added to Grace (SLURM).
It requires 65,000 SUs to submit a job that has 29 subreads.bam files
To estimate the number of required SUs to start an ArrowGrid_HPRC job, multiply the number of subreads.bam files * 2000 and add 7000
(number_of_subreads.bam_files * 2000) + 7000
The default is to run each array job for 3 days. You can change this using the -w option for arrow.sh (see below).
In this case you could estimate the number of required SUs as:
(number_of_subreads.bam_files * 28 * hours) + 7000
Example ArrowGrid_HPRC Job Script:
#!/bin/bash
#SBATCH --export=NONE # do not export current env to the job
#SBATCH --job-name=arrow_polish # set job name
#SBATCH --time=1-00:00:00 # Set the wall clock limit to 1 day
#SBATCH --nodes=1 # uses 1 node
#SBATCH --ntasks-per-node=1 # run 1 task (command) per node
#SBATCH --cpus-per-task=28 # each task (command) uses 28 cores
#SBATCH --mem=56G # total memory per node
#SBATCH --output=stdout.%x.%j # create a file for stdout
#SBATCH --error=stderr.%x.%j
module load ArrowGrid_HPRC/0.6.0-iomkl-2017b
source activate pbbioconda
arrow.sh -g genome.fasta -p prefix
You need a file named input.fofn in your working directory that contains one .subreads.bam file per line.
Helpful tips for polishing a large genome (polishing example below is for a 29 PacBio .bam files at 50x coverage):
- Start with the 29 unaligned PacBio bam files.
- create a file named input.fofn which contains a list of the 29 PacBio unaligned.bam files with paths (one per line)
- Your initial job script is only for running arrow.sh which configures the ArrowGrid array jobs so you do not need to submit a job array.
- There are two parallel stages, align and arrow, the array jobs will
show up in job NAME of the command squeue -l -u $USER prefixed
as align_ and cns_ respectively.
- if the alignment stage completes and the arrow stage fails, then reconfigure the SLURM parameters as needed and submit the job script again and the second stage will start from where it left off.
- When all 29 arrow array jobs are complete, there is a final merge step which results in a .fastq, .fasta and .gff file of your polished genome
- A test run of arrow on 29 unaligned PadBio.bam files for 50x coverage of a genome sized 3Gb completed in 36 hours using ArrowGrid across 29 compute nodes.
------------------------------------------------------------------------
Run Arrow in grid mode
------------------------------------------------------------------------
Usage: arrow.sh -g <genome_assembly.fasta> -p <prefix>
Required:
-g genome.fasta : GENOME: genome assembly fasta file (with .fai index file in same directory)
Optional:
-p prefix : PREFIX: prefix for output files (default: arrow_polish)
-r <text> : ALGORITHM: arrow, quiver (default: arrow)
-w <text> : walltime for array jobs (default: SLURM: 3-00:00:00)
-u <int> : USEGRID: 0, 1 (default: 1)
alignment step: : alignment step values apply to variantCaller step if varantCaller step values are not used
-a <text> : additional alignment job submission parameters: Ex: -a '--mail-type=ALL --mail-user=user@email.edu'
-c <int> : CPUs for each alignment (blasr) array job (default: 28)
-m <int> : memory per node (default: 56G)
-k <int> : SLURM tasks per node (default: 1)
-q <text> : queue (paritition) for array jobs (default: SLURM: long)
variantCaller step: : values for variantCaller step will equal alignment step if the following are not used
-A <text> : additional variantCaller job submission parameters: Ex: -A '--mail-type=ALL --mail-user=user@email.edu'
-C <int> : CPUs for each variantCaller array job, each CPU needs about 5GB memory (default: 14)
-M <int> : memory per node (default: 56G)
-K <int> : SLURM tasks per node (default: 1)
-Q <text> : queue (paritition) for array jobs (default: SLURM: long)
-h : print this help message.
If you use ArrowGrid, be sure to cite ArrowGrid: https://github.com/skoren/ArrowGrid
Purge_Haplotigs
GCATemplates available: no
Pipeline to help with curating heterozygous diploid genome assemblies.
Some parts of a genome may have a very high degree of heterozygosity. This causes contigs for both haplotypes of that part of the genome to be assembled as separate primary contigs, rather than as a contig and an associated haplotig.
Redundans
Redundans homepage
Redundans pipeline assists an assembly of heterozygous genomes. Program takes as input assembled contigs, sequencing libraries and/or reference sequence and returns scaffolded homozygous genome assembly. Final assembly should be less fragmented and with total size smaller than the input contigs. In addition, Redundans will automatically close the gaps resulting from genome assembly or scaffolding.
Circlator
Circlator homepage
A tool to circularize genome assemblies.
Circlator will attempt to identify each circular sequence and output a linearised version of it. It does this by assembling all reads that map to contig ends and comparing the resulting contigs with the input assembly.
wtdbg2
wtdbg2 homepage
WTDBG: De novo assembler for long noisy sequences
Wtdbg2 is a de novo sequence assembler for long noisy reads produced by PacBio or Oxford Nanopore Technologies (ONT). It assembles raw reads without error correction and then builds the consensus from intermediate assembly output.
Sequence alignments
Minimap2
Minimap2 homepage
Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database.
pbalign
pbalign homepage
pbalign maps PacBio reads to reference sequences.
Also be sure to specify the number of OMP threads. If you are using 20 cores "#SBATCH --cpus-per-task=28" then use:
export OMP_NUM_THREADS=20
blasr
GCATemplates available: no
blasr homepage
The PacBio long read aligner
Also be sure to specify the number of OMP threads. If you are using 20 cores "#BSUB -n 20" then use:
export OMP_NUM_THREADS=20
PBSuite
GCATemplates available: no
PBSuite homepage
PBJelly is a highly automated pipeline that aligns long sequencing reads (such as PacBio RS reads or long 454 reads in fasta format) to high-confidence draft assemblies.
PBJelly fills or reduces as many captured gaps as possible to produce upgraded draft genomes.