Difference between revisions of "HPRC:File Transfers"
Revision as of 10:20, 31 January 2019
File Transfer on HPRC Clusters
File Transfer Software
There are several options for choosing a software to transfer files to and from HPRC clusters. The choice is largely depending on many factors, such as size, location, transfer frequency, etc. If the data size is small (transfer time is less than an hour), just pick a software convenient/familiar to you.
Globus Connect is a reliable, high-performance file transfer platform allowing users to transfer large amounts of data seamlessly between systems or endpoints. Users can schedule transfer via a web interface on globus.org and receive notification after transfer is completed. The endpoint can be systems with Globus installed (like ada-ftn1) or user's personal desktop.
What Globus Connect is good at
- transfer large amount of data (say 100+ GB)
- it's fast (utilizing up to 4 data streams); as fast as the slowest link from your server/desktop/laptop to HPRC fast transfer nodes
- transfer files between two endpoints (for example, between Ada and Terra, or between Ada and endpoint on your laptop)
- personal endpoint works behind NAT (Network Address Translation; like your desktop behind a wifi router at home)
- resume for failed transfers
- receive notification after a scheduled transfer is completed
What Globus Connect is not good at
- your server or desktop/laptop must have Globus Connect software installed and setup as an endpoint
- by default, data stream is not encrypted
How do I use Globus Connect
- visit Globus Connect wiki page for more information
- use endpoints: "TAMU ada-ftn1" or "TAMU ada-ftn2" for Ada/Curie cluster and "TAMU terra-ftn" for Terra cluster
SCP and SFTP protocols are a means of securely transferring computer files between a local host and a remote host.
What SCP/SFTP is good at
- ubiquitous; simple to use
- sftp offers an interactive interface (command line) to download/upload files
- data stream is encrypted
What SCP/SFTP is not good at
- not very fast (file transfer only uses one data stream over SSH protocol)
How do I use SCP/SFTP
- you can use command line on Linux, Mac or MobaXterm terminal to issue scp/sftp command
- use WinSCP on Windows, FileZilla (use SFTP protocol) on Windows or Mac, or use File Transfer panel on MobaXterm
- use "ada-ftn1.tamu.edu" or "ada-ftn2.tamu.edu" for Ada/Curie and "terra-ftn.hprc.tamu.edu" for Terra if your data transfer to Ada or Terra login nodes (ada.tamu.edu or terra.tamu.edu) is terminated after one hour; Ada/Terra login nodes have one hour CPU limit for all user processes.
- check scp man page and sftp man page for options and examples
rsync is a fast, versatile, remote (and local) file-copying tool and recommended when relatively few differences exist between target and source versions, because rsync copies only the differences of files that have actually changed. By default, rsync uses the SSH remote shell.
What rsync is good at
- resume file transfer for partial transferred file
- synchronize files/dirs of two directories (local-local, local-remote, remote-local)
- by default, transfer is over SSH protocol, so data stream is encrypted
What rsync is not good at
- by default, files transferred over SSH which uses only one data stream and not very fast
- compression option, "-z", might not shorten the transfer time
How do I use rsync
- from command line on Linux, Mac, or MobaXterm terminal to issue rsync command
- use DeltaCopy or Grsync on Windows
- use cwRsync 5.4.1 for command line on Windows
- check rsync man page for options and examples
rclone is a tool for syncing files from HPRC systems to remote storage sites like Google Drive, Dropbox, Amazon's AWS and many more.
What rclone is good at
- copy data to or from cloud (Google Drive, Dropbox, AWS, etc)
What rclone is not good at
- transfer can be be slow
How do I use rclone
- rclone is available on ada, terra and HPRC Lab workstations. No module is required for any of them
- reference rclone wiki page for instructions and examples
TAMU OnDemand or portal is a web platform through which users can access HPRC clusters and services with a web browser. You can download/upload file via menu "Files".
What portal is good at
- web interface and simple to use
- you can view content (text, image, movie) via web browser
What portal is not good at
- transfer one file at a time
- transfer can be slow (file is transfer via single data stream)
How do I use portal
- visit https://portal.hprc.tamu.edu (currently for Ada cluster only; portal for Terra cluster coming soon)
- check [[SW:Portal|portal] wiki page for additional info
- I am off campus or travel abroad. Can I transfer files from/to HPRC cluster?
All Ada/Terra login nodes are behind TAMU campus firewall, so TAMU VPN access is required if you are off campus. If you have personal Globus Connect endpoint setup on your laptop, you can transfer files to/from your laptop via globus.org using Globus Connect, without using TAMU VPN.
- Should I use FTN or regular Ada/Terra login nodes to transfer files?
When you connect to ada.tamu.edu, terra.tamu.edu or curie.tamu.edu, you are connected to one of the Ada/Terra/Curie login nodes. Ada/Terra/Curie login nodes have 1 Gbps link to campus and are behind TAMU campus firewall, so Ada/Terra/Curie login nodes are reachable from campus or over TAMU VPN. Ada FTN (ada-ftn1.tamu.edu and ada-ftn2.tamu.edu, with 40 Gpbs link) and Terra FTN (terra-ftn.hprc.tamu.edu with 10 Gbps link) are outside campus firewall, but have limited presence on internet (SSH only accessible from campus and TACC; Globus Connect accessible from the world).
For short file transfer (less than one hour), either one would work. For transfer time over 1 hour, please use FTN node (ada-ftn1.tamu.edu and ada-ftn2.tamu.edu for Ada/Curie cluster; terra-ftn.hprc.tamu.edu for Terra cluster), which does not have one hour CPU process time limit.
If your desktop is connected to 1 Gbps link on campus, you might notice a faster transfer rate to Ada/Terra login nodes than to FTN. FTN nodes are outside TAMU campus firewall, so access FTN from campus will go through firewall, thus adding delay to the transfer.
- Can I download files from internet (say NIH) inside a job script?
Ada/Terra compute nodes do not have access to internet, so you cannot download files from internet on the compute node. Please download necessary files on Ada/Terra login nodes or FTN nodes.
- I want to back up files to my desktop/laptop.
"rsync" or "rclone" (to cloud storage) probably the best choice.
- I have 100+ TB data to transfer.
Please contact us at firstname.lastname@example.org. We would like to get more information first and see how we can support this.
- Why the file transfer takes so long?
This is a complicated question. The file transfer time is largely depended on data size, bandwidth of the slowest link from HPRC cluster to your desktop (bottleneck link), how congested/busy the network link is, and how fast file system can read/write. For 100 Giga Bytes data transferred over a 100 Mbps link, it will take about 170 min under the best case scenario (80% efficiency, no cross traffic and no I/O bottleneck). Often, link speed of network to your desktop/laptop (the last mile) and storage on your desktop/laptop are the slowest part for the entire file transfer.
Number of data transfer streams would make a difference as well. Globus Connect utilizes up to 4 data streams to shorten the transfer time (more noticeable for large files; typically seeing 2.5x speedup).