NGS: Common Tools

Parallel

GCATemplates available: no

Gnu Parallel - build and execute shell command lines from standard input in parallel

Example Grace job script where many input .fasta files are in a directory named inputs and an empty directory exists named outputs

#!/bin/bash
#SBATCH --job-name=my_parallel_job  # job name
#SBATCH --time=7-00:00:00           # max job run time dd-hh:mm:ss
#SBATCH --ntasks-per-node=1         # tasks (commands) per compute node
#SBATCH --cpus-per-task=48          # CPUs (threads) per command
#SBATCH --mem=360G                  # total memory per node
#SBATCH --output=stdout.%x.%j       # save stdout to file
#SBATCH --error=stderr.%x.%j        # save stderr to file

module purge
module load GCC/10.2.0  OpenMPI/4.0.5  MAFFT/7.490-with-extensions
module load parallel/20210322

parallel --joblog mafft.log "mafft {} > outputs/{/.}.aln.fasta" ::: inputs/*.fasta

When the job is complete, you can check the mafft.log file to see that all commands completed successfully by looking at the Exitval column where 0 = completed successfully and a non-zero value = failed.

Seq Host Starttime JobRuntime Send ReceiveExitvalSignal Command
1 : 1638568687.578 0.506 0 000 mafft inputs/sample1.fasta > outputs/sample1.aln.fasta

If any jobs failed due to running out of walltime or not enough memory, you can restart the job by running the job script again using the --resume-failed option which will figure out the failed jobs and run those again. After that it will resume last unfinished job and continue from there.

parallel--resume-failed--joblog mafft.log "mafft {} > outputs/{/.}.aln.fasta" ::: inputs/*fasta

command option	description	example output
{}	input line	inputs/sample1.fasta
{.}	input line without extension	inputs/sample1
{/}	basename of input line	sample1.fasta
{/.}	basename of input line without extension	sample1

BamTools

GCATemplates available: no

BamTools provides both a programmer's API and an end-user's toolkit for handling BAM files.

module spider BamTools

bgzip

GCATemplates available: no

bgzip compresses files in a similar manner to, and compatible with, gzip

bgzip can be found in the HTSlib modules

module load HTSlib/1.9-intel-2018b

Picard tools

GCATemplates available: no

module spider picard

The Picard tools use the /tmp directory which has limited space.

Use the automatically generated $TMPDIR directory for all the Picard tools.

Here is an example of using the $TMPDIR in a job that reserves 54 GB of memory for the job.

java -Xmx48g -jar $EBROOTPICARD/SortSam.jar TMP_DIR=$TMPDIR I=myfile.sam O=myfile_sorted.bam SO=coordinate VALIDATION_STRINGENCY=LENIENT

The java process is limited to 48GB and once picard tools uses all the 48GB of memory, then temporary files are written in the $TMPDIR.

tabix

GCATemplates available: no

Tabix indexes a TAB-delimited genome position file in.tab.bgz and creates an index file (in.tab.bgz.tbi or in.tab.bgz.csi) when region is absent from the command-line. The input data file must be position sorted and compressed by bgzip which has a gzip(1) like interface.

The tabix command is found in the HTSlib module

module spider HTSlib

VCFtools

GCATemplates available: no

VCFtools is a program package designed for working with VCF files, such as those generated by the 1000 Genomes Project. The aim of VCFtools is to provide easily accessible methods for working with complex genetic variation data in the form of VCF files.

module spider VCFtools

BEDTools

GCATemplates available: no

The BEDTools utilities allow one to address common genomics tasks such as finding feature overlaps and computing coverage. The utilities are largely based on four widely-used file formats: BED, GFF/GTF, VCF, and SAM/BAM.

module spider BEDTools

pybedtools

GCATemplates available: no

The BEDTools suite of programs is widely used for genomic interval manipulation or “genome algebra”. pybedtools wraps and extends BEDTools and offers feature-level manipulations from within Python.

module spider pybedtools

seqtk

GCATemplates

Grace

Homepage: https://github.com/lh3/seqtk

 module spider Seqtk

SRA-toolkit

Used to download Sequence Read Archive files and extract into fasta file(s).

# on Grace

module load GCC/10.2.0  OpenMPI/4.0.5  SRA-Toolkit/2.10.9

SRA-toolkit will download files to your home directory be default and since your home directory is limited to 10GB, you can redirect the downloads to your scratch space by creating a directory in scratch and making a symbolic link to that directory from your home directory.

For the newer versions of SRA-Toolkit, this is done using the vdb-config command to configure the cache directory to a directory in your $SCRATCH directory. You only need to run vdb-config once to set up the cache directory. Do the following on the Grace login command prior to submitting a job script.

mkdir /scratch/user/YOUR_NETID/ncbi

# on Grace
module purge
module load GCC/10.2.0  OpenMPI/4.0.5  SRA-Toolkit/2.10.9
vdb-config --interactive

    # use letter and tab keys or mouse click to select menu items
type c for CACHE
type o for choose
click [ Create Dir ] then hit enter and type /scratch/user/YOUR_NETID/ncbi
then select OK and hit enter and hit y to select yes for the question to change the location
then click s to save and x to exit

    # when you are done downloading and processing sra files, you will need to remove downloaded .sra files
    #   from the directory /scratch/user/your_netid/ncbi/sra/

The compute nodes are not connected to the internet so you will need to add the WebProxy lines after loading the SRA-Toolkit module in your job script.

With the web proxy lines added to a job script, you can prefetch a .sra file then use fastq-dump to process the .sra file. Downloading the .sra file to $TMPDIR is a good approach since the $TMPDIR is deleted after the job is complete. You just need to specify an output directory for the processed fastq files.

#!/bin/bash
#SBATCH --export=NONE               # do not export current env to the job
#SBATCH --job-name=fastq-dump       # job name
#SBATCH --time=1-00:00:00           # max job run time dd-hh:mm:ss
#SBATCH --ntasks-per-node=1         # tasks (commands) per compute node
#SBATCH --cpus-per-task=2           # CPUs (threads) per command
#SBATCH --mem=10G                   # total memory per node
#SBATCH --output=stdout.%x.%j          # save stdout to file
#SBATCH --error=stderr.%x.%j           # save stderr to file

# on Grace
module purge
module load GCC/10.2.0  OpenMPI/4.0.5  SRA-Toolkit/2.10.9

# enable proxy to allow compute node connection to internet
module load WebProxy

prefetch --output-directory $TMPDIR SRR575500  && \
fastq-dump --outdir seqs -F -I --gzip $TMPDIR/SRR575500/SRR575500.sra

add the --split-files option to the fastq-dump command for paired end reads

SRA-toolkit (fast-dump) is also available in Maroon Galaxy.

Browse SRA using SRA Explorer where you can get URLs using the 'saved datasets' feature to directly download fastq files using wget instead of having to use SRA-toolkit.

Install Aspera

SRA-Toolkit will look to see if you have Aspera installed. The Aspera ascp command will download SRA files quicker than wget.

Run the installation script from any directory. This will install configuration files in your ~/.aspera

/scratch/data/bio/bin/ibm-aspera-connect_4.0.2.38_linux.sh

Then add the following to your PATH in your ~/.bash_profile file:

PATH=$PATH:$HOME/.aspera/connect/bin

Example command:

ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb\_id\_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/ERR315/009/ERR3155119/ERR3155119.fastq.gz./