Aspera (SRA, 1000genomes, BioMart, Illumina BaseMount)
Aspera
Install Aspera
SRA-Toolkit will look to see if you have Aspera installed.
The Aspera ascp command will download SRA files quicker than wget.
Run the following command from any directory. This will install configuration files in your ~/.aspera
/scratch/data/bio/bin/ibm-aspera-connect_4.0.2.38_linux.sh
Then add the following to your PATH in your ~/.bash_profile file
PATH=$PATH:$HOME/.aspera/connect/bin
Example command: ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/ERR315/009/ERR3155119/ERR3155119.fastq.gz ./
Downloading 1000 genomes data
Login to an Grace data transfer node from your desktop
ssh netid@grace-dtn1.hprc.tamu.edu
Sample command to download a fastq.gz file
ascp -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh -QTr -l10000m \
anonftp@ftp-trace.ncbi.nih.gov:/1000genomes/ftp/phase3/data/NA21087/sequence_read/SRR442587_1.filt.fastq.gz ./
Uploading to SRA
Login to a fast transfer node from a Grace login node
Grace: grace-dtn1.hprc.tamu.edu or grace-dtn2.hprc.tamu.edu
or
ssh netid@grace-dtn1.hprc.tamu.edu
Sample command to upload to SRA
time ascp -i <path/to/ncbi_key_file> -QT -l10000m -k1 -d <path/to/files/directory/> \
subasp@upload.ncbi.nlm.nih.gov:uploads/<ncbi_account_email>_<random_code>/<submission_folder>/
The above command is used without the \< and > which are just used to highlight what you need to provide.
key file is provided by NCBI. must be an absolute path, e.g.: /home/ /keys/aspera.openssh
random code for upload is provided by NCBI
is required and will be created automatically by the ascp command.
Notice usage of the time command before the ascp command. If the command stops at 60 minutes then your upload was not complete.
SRA-toolkit
Used to download Sequence Read Archive files and extract into fasta file(s).
# on Grace
module load GCC/10.2.0 OpenMPI/4.0.5 SRA-Toolkit/2.10.9
and
SRA-toolkit will download files to your home directory be default and since your home directory is limited to 10GB, you can redirect the downloads to your scratch space by creating a directory in scratch and making a symbolic link to that directory from your home directory.
For the newer versions of SRA-Toolkit, this is done using the vdb-config command to configure the cache directory to a directory in your $SCRATCH directory. You only need to run vdb-config once to set up the cache directory. Do the following on the Grace login command prior to submitting a job script.
mkdir /scratch/user/YOUR_NETID/ncbi
# on Grace
module load GCC/10.2.0 OpenMPI/4.0.5 SRA-Toolkit/2.10.9
vdb-config --interactive
# use letter and tab keys or mouse click to select menu items
type c for CACHE
type o for choose
click [ Create Dir ] then hit enter and type /scratch/user/YOUR_NETID/ncbi
then select OK and hit enter and hit y to select yes for the question to change the location
then click s to save and x to exit
# when you are done downloading and processing sra files, you will need to remove downloaded .sra files
# from the directory /scratch/user/your_netid/ncbi/sra/
The compute nodes are not connected to the internet so you will need to add the proxy lines after loading the SRA-Toolkit module in your job script.
With the web proxy lines added to a job script, you can prefetch a .sra file then use fastq-dump to process the .sra file. Downloading the .sra file to $TMPDIR is a good approach since the $TMPDIR is deleted after the job is complete. You just need to specify an output directory for the processed fastq files.
#!/bin/bash
#SBATCH --export=NONE # do not export current env to the job
#SBATCH --job-name=fastq-dump # job name
#SBATCH --time=1-00:00:00 # max job run time dd-hh:mm:ss
#SBATCH --ntasks-per-node=1 # tasks (commands) per compute node
#SBATCH --cpus-per-task=2 # CPUs (threads) per command
#SBATCH --mem=10G # total memory per node
#SBATCH --output=stdout.%x.%j # save stdout to file
#SBATCH --error=stderr.%x.%j # save stderr to file
# on Grace
module load GCC/10.2.0 OpenMPI/4.0.5 SRA-Toolkit/2.10.9
# enable proxy to allow compute node connection to internet
module load WebProxy
prefetch --output-directory $TMPDIR SRR575500 && \
fastq-dump --outdir seqs -F -I --gzip $TMPDIR/SRR575500/SRR575500.sra
- add the --split-files option to the fastq-dump command for paired end reads
SRA-toolkit (fast-dump) is also available in Maroon Galaxy.
Browse SRA using SRA Explorer where you can get URLs using the 'saved datasets' feature to directly download fastq files using wget instead of having to use SRA-toolkit.
Install Aspera
SRA-Toolkit will look to see if you have Aspera installed. The Aspera ascp command will download SRA files quicker than wget. Run the installation script from any directory. This will install configuration files in your ~/.aspera
/scratch/helpdesk/ngs/ibm-aspera-connect-3.9.8.176272-linux-g2.12-64.sh
Example command: ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/ERR315/009/ERR3155119/ERR3155119.fastq.gz ./
gsutil
gsutil is a Python application that lets you access Cloud Storage from the command line.
gsutil homepage
-
Go to https://cloud.google.com to register for Google Cloud.
-
After you register, you will receive $300 in Free Trial Credits. (Free Trial period is 3 months)
-
You're also eligible for an additional $100.00 in Free Trial credits for a total of $400.00. You'll receive these credits within 24 hours of completing signup.
-
Create a Google Cloud project for which you will link a payment method in the next step
-
For this tutorial, the Google Cloud project will be named my_sra_downloads with an auto generated id as my_sra_downloads-355918
-
Go to the Billing tab in the left panel of your cloud.google.com dashboard and enable billing by linking your account to a credit card.
-
Your card does not get charged during the Free Trial period and auto-billing after the Free Trial is disabled.
-
Login to an HPRC cluster and load the gsutil module
Grace
module load GCCcore/11.2.0 gsutil/5.10 5. Configure gsutil to allow read/write access
gsutil config -f * Copy the URL in the output of the gsutil command and paste it into your web browser. * Copy the code from the web browser output and paste in the terminal and hit enter 6. Copy a file from the google cloud into the current directory
gsutil -u <project_id> cp <bucket URL> <output_file_name>
-
Example command for downloading one file using project_id my_sra_downloads-355918 for sra file gs://sra-pub-src-18/SRR13606306/E_S1_L001_I1_001.fastq.gz.1 and naming it E_S1_L001_I1_001.fastq.gz
gsutil -u my_sra_downloads-355918 cp gs://sra-pub-src-18/SRR13606306/E_S1_L001_I1_001.fastq.gz.1 E_S1_L001_I1_001.fastq.gz
BioMart
biomaRt
The biomaRt package, provides an interface to a growing collection of databases implementing the BioMart software suite. The package enables retrieval of large amounts of data in a uniform way without the need to know the underlying database schemas or write complex SQL queries.
biomaRt userguide
Rstudio on the HPRC portal cannot be used for biomaRt since Rstudio runs on a compute node and the compute nodes do not have internet access
Once you have saved your XML query from a web based BioMart interface to a text file, use the following example to retrieve your query from biomart.org.
Read your BioMart XML query file into R, retreive sequencees and print output to a file:
Example of XML query saved to a text file (query.xml for example):
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Query>
<Query virtualSchemaName = "default" formatter = "FASTA" header = "0" uniqueRows = "0" count = "" datasetConfigVersion = "0.6" >
<Dataset name = "mmusculus_gene_ensembl" interface = "default" >
<Filter name = "start" value = "1"/>
<Filter name = "chromosome_name" value = "18"/>
<Filter name = "end" value = "5000000"/>
<Attribute name = "peptide" />
<Attribute name = "ensembl_gene_id" />
<Attribute name = "ensembl_transcript_id" />
</Dataset>
</Query>
The biomaRt R package can be accessed using the R_tamu module:
module load R_tamu/3.5.0-iomkl-2017b-recommended-mt
Start the R command line
R
Then on the R command line load the biomarRt library
library(biomaRt)
R commands to retrieve fasta sequences using the query.xml file and save results to a file (martout.fasta in this example):
myxml<-readLines('query.xml')
mytxt<-paste(unlist(myxml), collapse='')
myseqs<-getXML(xmlquery=mytxt)
write.table(myseqs, 'martout.fasta', row.names=F, col.names=F, quote=F)
There is a warning message which can be ignored:
Warning message:
Function 'getXML()' is deprecated.
Use 'biomaRt:::.submitQueryXML' instead
See help('getXML') for further details
Illumina BaseMount
Basemount
Basemount is used to mount your Illumina BaseSpace account so you can copy files directly from BaseSpace to Grace.
Install the BaseSpace executable in your $HOME/bin directory using the following:
mkdir $HOME/bin
wget "https://launch.basespace.illumina.com/CLI/latest/amd64-linux/bs" -O $HOME/bin/bs
chmod u+x $HOME/bin/bs
Then run the following to authenticate your BaseSpace account
bs auth
You will be prompted with an Illumina BaseSpace URL to complete the authentication at the BaseSpace website.
Once you login to the Illumina BaseSpace website and allow access, you will see a welcome message back at the terminal command line.
Example command to list your BaseSpace projects:
bs list project
Example command to see files (datasets) within each project:
bs list datasets
See the BaseSpace CLI examples for more commands including downloading fastq files: