Aspera (SRA, 1000genomes, BioMart, Illumina BaseMount)

Aspera

Install Aspera

SRA-Toolkit will look to see if you have Aspera installed.

The Aspera ascp command will download SRA files quicker than wget.

Run the following command from any directory. This will install configuration files in your ~/.aspera

/scratch/data/bio/bin/ibm-aspera-connect_4.0.2.38_linux.sh

Then add the following to your PATH in your ~/.bash_profile file

PATH=$PATH:$HOME/.aspera/connect/bin

Example command: ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/ERR315/009/ERR3155119/ERR3155119.fastq.gz ./

Downloading 1000 genomes data

Login to an Grace data transfer node from your desktop

ssh netid@grace-dtn1.hprc.tamu.edu

Sample command to download a fastq.gz file

ascp -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh -QTr -l10000m \
anonftp@ftp-trace.ncbi.nih.gov:/1000genomes/ftp/phase3/data/NA21087/sequence_read/SRR442587_1.filt.fastq.gz ./

Uploading to SRA

Login to a fast transfer node from a Grace login node

Grace: grace-dtn1.hprc.tamu.edu or grace-dtn2.hprc.tamu.edu

or

ssh netid@grace-dtn1.hprc.tamu.edu

Sample command to upload to SRA

time ascp -i <path/to/ncbi_key_file> -QT -l10000m -k1 -d <path/to/files/directory/> \
subasp@upload.ncbi.nlm.nih.gov:uploads/<ncbi_account_email>_<random_code>/<submission_folder>/

The above command is used without the \< and > which are just used to highlight what you need to provide.

key file is provided by NCBI. must be an absolute path, e.g.: /home//keys/aspera.openssh

random code for upload is provided by NCBI

is required and will be created automatically by the ascp command.

Notice usage of the time command before the ascp command. If the command stops at 60 minutes then your upload was not complete.

SRA-toolkit

Used to download Sequence Read Archive files and extract into fasta file(s).

# on Grace
module load GCC/10.2.0  OpenMPI/4.0.5  SRA-Toolkit/2.10.9

and

SRA-toolkit will download files to your home directory be default and since your home directory is limited to 10GB, you can redirect the downloads to your scratch space by creating a directory in scratch and making a symbolic link to that directory from your home directory.

For the newer versions of SRA-Toolkit, this is done using the vdb-config command to configure the cache directory to a directory in your $SCRATCH directory. You only need to run vdb-config once to set up the cache directory. Do the following on the Grace login command prior to submitting a job script.

mkdir /scratch/user/YOUR_NETID/ncbi

# on Grace
module load GCC/10.2.0  OpenMPI/4.0.5  SRA-Toolkit/2.10.9
vdb-config --interactive

    # use letter and tab keys or mouse click to select menu items
type c for CACHE
type o for choose
click [ Create Dir ] then hit enter and type /scratch/user/YOUR_NETID/ncbi
then select OK and hit enter and hit y to select yes for the question to change the location
then click s to save and x to exit

    # when you are done downloading and processing sra files, you will need to remove downloaded .sra files
    #   from the directory /scratch/user/your_netid/ncbi/sra/

The compute nodes are not connected to the internet so you will need to add the proxy lines after loading the SRA-Toolkit module in your job script.

WebProxy

With the web proxy lines added to a job script, you can prefetch a .sra file then use fastq-dump to process the .sra file. Downloading the .sra file to $TMPDIR is a good approach since the $TMPDIR is deleted after the job is complete. You just need to specify an output directory for the processed fastq files.

#!/bin/bash
#SBATCH --export=NONE               # do not export current env to the job
#SBATCH --job-name=fastq-dump       # job name
#SBATCH --time=1-00:00:00           # max job run time dd-hh:mm:ss
#SBATCH --ntasks-per-node=1         # tasks (commands) per compute node
#SBATCH --cpus-per-task=2           # CPUs (threads) per command
#SBATCH --mem=10G                   # total memory per node
#SBATCH --output=stdout.%x.%j          # save stdout to file
#SBATCH --error=stderr.%x.%j           # save stderr to file

# on Grace
module load GCC/10.2.0  OpenMPI/4.0.5  SRA-Toolkit/2.10.9

# enable proxy to allow compute node connection to internet
module load WebProxy

prefetch --output-directory $TMPDIR SRR575500  && \
fastq-dump --outdir seqs -F -I --gzip $TMPDIR/SRR575500/SRR575500.sra

add the --split-files option to the fastq-dump command for paired end reads

SRA-toolkit (fast-dump) is also available in Maroon Galaxy.

Browse SRA using SRA Explorer where you can get URLs using the 'saved datasets' feature to directly download fastq files using wget instead of having to use SRA-toolkit.

Install Aspera

SRA-Toolkit will look to see if you have Aspera installed. The Aspera ascp command will download SRA files quicker than wget. Run the installation script from any directory. This will install configuration files in your ~/.aspera

/scratch/helpdesk/ngs/ibm-aspera-connect-3.9.8.176272-linux-g2.12-64.sh

Example command: ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/ERR315/009/ERR3155119/ERR3155119.fastq.gz ./

gsutil

gsutil is a Python application that lets you access Cloud Storage from the command line.

gsutil homepage

Go to https://cloud.google.com to register for Google Cloud.
After you register, you will receive $300 in Free Trial Credits. (Free Trial period is 3 months)
You're also eligible for an additional $100.00 in Free Trial credits for a total of $400.00. You'll receive these credits within 24 hours of completing signup.
Create a Google Cloud project for which you will link a payment method in the next step
For this tutorial, the Google Cloud project will be named my_sra_downloads with an auto generated id as my_sra_downloads-355918
Go to the Billing tab in the left panel of your cloud.google.com dashboard and enable billing by linking your account to a credit card.
Your card does not get charged during the Free Trial period and auto-billing after the Free Trial is disabled.
Login to an HPRC cluster and load the gsutil module

Grace

module load GCCcore/11.2.0 gsutil/5.10 5. Configure gsutil to allow read/write access

gsutil config -f * Copy the URL in the output of the gsutil command and paste it into your web browser. * Copy the code from the web browser output and paste in the terminal and hit enter 6. Copy a file from the google cloud into the current directory
```
gsutil -u <project_id> cp <bucket URL> <output_file_name>
```
Example command for downloading one file using project_id my_sra_downloads-355918 for sra file gs://sra-pub-src-18/SRR13606306/E_S1_L001_I1_001.fastq.gz.1 and naming it E_S1_L001_I1_001.fastq.gz
```
gsutil -u my_sra_downloads-355918 cp gs://sra-pub-src-18/SRR13606306/E_S1_L001_I1_001.fastq.gz.1 E_S1_L001_I1_001.fastq.gz
```

BioMart

biomaRt

The biomaRt package, provides an interface to a growing collection of databases implementing the BioMart software suite. The package enables retrieval of large amounts of data in a uniform way without the need to know the underlying database schemas or write complex SQL queries.

biomaRt userguide

Rstudio on the HPRC portal cannot be used for biomaRt since Rstudio runs on a compute node and the compute nodes do not have internet access

Once you have saved your XML query from a web based BioMart interface to a text file, use the following example to retrieve your query from biomart.org.

Read your BioMart XML query file into R, retreive sequencees and print output to a file:

Example of XML query saved to a text file (query.xml for example):

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Query>
<Query  virtualSchemaName = "default" formatter = "FASTA" header = "0" uniqueRows = "0" count = "" datasetConfigVersion = "0.6" >

    <Dataset name = "mmusculus_gene_ensembl" interface = "default" >
        <Filter name = "start" value = "1"/>
        <Filter name = "chromosome_name" value = "18"/>
        <Filter name = "end" value = "5000000"/>
        <Attribute name = "peptide" />
        <Attribute name = "ensembl_gene_id" />
        <Attribute name = "ensembl_transcript_id" />
    </Dataset>
</Query>

The biomaRt R package can be accessed using the R_tamu module:

module load R_tamu/3.5.0-iomkl-2017b-recommended-mt

Start the R command line

R

Then on the R command line load the biomarRt library

library(biomaRt)

R commands to retrieve fasta sequences using the query.xml file and save results to a file (martout.fasta in this example):

myxml<-readLines('query.xml')
mytxt<-paste(unlist(myxml), collapse='')
myseqs<-getXML(xmlquery=mytxt)
write.table(myseqs, 'martout.fasta', row.names=F, col.names=F, quote=F)

There is a warning message which can be ignored:

Warning message:

Function 'getXML()' is deprecated.

Use 'biomaRt:::.submitQueryXML' instead

See help('getXML') for further details

Illumina BaseMount

Basemount

Basemount is used to mount your Illumina BaseSpace account so you can copy files directly from BaseSpace to Grace.

Install the BaseSpace executable in your $HOME/bin directory using the following:

 mkdir $HOME/bin
 wget "https://launch.basespace.illumina.com/CLI/latest/amd64-linux/bs" -O $HOME/bin/bs
 chmod u+x $HOME/bin/bs

Then run the following to authenticate your BaseSpace account

bs auth

You will be prompted with an Illumina BaseSpace URL to complete the authentication at the BaseSpace website.

Once you login to the Illumina BaseSpace website and allow access, you will see a welcome message back at the terminal command line.

Example command to list your BaseSpace projects:

bs list project

Example command to see files (datasets) within each project:

bs list datasets

See the BaseSpace CLI examples for more commands including downloading fastq files: