Ada:Fast Data Transfer
The following sections describe how to transfer massive amount of data to/from Ada cluster using Xtra-Fast Data Transfer Nodes (FTN). You will need a vaild account on Ada to use FTN nodes.
- 1 Introduction
- 2 Use Policy
- 3 Access
- 4 Environment and Software
- 4.1 Data transfer using Globus Connect
- 4.2 Data transfer using gridftp
- 4.3 Data transfer using rsync
- 4.4 Data transfer using scp
Two nodes, ada-ftn1.tamu.edu and ada-ftn2.tamu.edu, are exclusively dedicated to the fast transfer of massive amounts of data. One node is a 20-core 128 GB memory nodes based on the Intel Ivy Bridge processor. The second node is an 8-core 128 GB memory node based on the Intel Ivy Bridge processor. These nodes have 40GigE capability connected to Internet 2. Both nodes are IB configured to have access to all of the parallel (GPFS-based) file systems defined on Ada.
Acronyms Used: FTN stands for fast transfer node, ada-ftn1.tamu.edu and/or ada-ftn2.tamu.edu.
Additional information on file transfer options, please refer to file transfers wiki page.
The FTN nodes are exclusively dedicated for the transfer of large amounts of data. For that reason, there is no programming environment installed. Processes unrelated to data transfer will be terminated.
You will need a valid account on Ada to access FTN nodes, ada-ftn1.tamu.edu and ada-ftn2.tamu.edu. Users can access FTN nodes from the following hosts. We are working on expanding access from hosts on Internet2.
- hosts on TAMU campus network
- login nodes of Ada (can use local/internal hostname alias ftn1 or ftn2)
- login nodes of Terra
- login nodes of Lonestar and Stampede clusters at TACC
Environment and Software
Globus Connect and gridftp are file transfer utilities to be used when large data transfers are involved. Typically, by supporting multiple streams, they yield better rates, than rsync and scp, which use one stream only. scp and rsync are typically more suitable for small-to-medium sized file transfers. In some cases, experimentation is called for in deciding the better option.
The FTN nodes are dedicated and configured to doing just that: transfers of massive amounts of data across different host/systems.
Comparison of data transfer software:
|Software||Multiple Streams Support||Data Channel Encryption*||Available from hosts|
|Globus Connect||Yes (via command line)||Yes, but not on by default**||access via globus.org|
|gridftp||Yes||Yes, but not on by default||FTN|
|rsync||No||Yes if connect via SSH||login and FTN|
|scp/sftp||No||Yes||login and FTN|
* When transferring confidential data, enable data encryption in Gridftp or use scp/rsync.
** Encryption has been enabled by default for Ada Globus Connect endpoint on Jan 14, 2021.
Data transfer using Globus Connect
Globus Connect is a reliable, high-performance file transfer platform allowing users to transfer large amounts of data seamlessly between systems or endpoints. Users can schedule transfer via a web interface and receive notification after transfer is completed. The endpoint can be systems with Globus installed (like ada-ftn2) or user's personal desktop. Please reference Globus Connect page for details.
Data transfer using gridftp
GridFTP is a high-performance data transfer protocol based on FTP and optimized for high-bandwidth wide-area networks. More information on GridFTP may be found on the Globus site. Currently, GridFTP on Xtra-fast Transfer Nodes (FTN) only supports SSH authentication.
Transferring data from FTN nodes
GridFTP uses command globus-url-copy to launch a transfer. Remote system needs to support SSHFTP. Note that Globus installed on Lonestar and Stampede doesn't support SSHFTP. Please see next section on how to launch globus-url-copy from remote system.
For example, while on an FTN node, you can transfer file1 in your scratch directory to file2 on a remote system, with 4 parallel streams:
$ globus-url-copy -p 4 -v -vb file:/scratch/user/$USER/file1 sshftp://email@example.com/remote/dir/file2
-p 4 sets the number of parallel network streams to 4 -v produces verbose output -vb displays the number of bytes transferred and the transfer rate per second
From FTN, transfer file1 on a remote system to file2 in your scratch directory:
$ globus-url-copy -p 4 -v -vb sshftp://remote.system/remote/dir/file1 file:/scratch/user/$USER/file2
For long distance data transfer, you can specify TCP buffer size to improve throughput.
$ globus-url-copy -p 4 -tcp-bs 8M -bs 8M -vb sshftp://remote.system/remote/dir/file1 file:/scratch/user/$USER/file2
-tcp-bs specify the size (in bytes) of the buffer to be used by the underlying ftp data channels -bs specify the size (in bytes) of the buffer to be used by the underlying transfer methods
Transfer data from remote system
From remote system, transfer file1 on a remote system to file2 in your scratch on Ada. Please use this method for Lonestar and Stampede (without tcp-bs and bs).
$ globus-url-copy -tcp-bs 12M -bs 12M -p 4 -v -vb file:/remote/dir/file1 sshftp://firstname.lastname@example.org/scratch/user/$USER/file2
Encryption and Integrity protection
The data channel is authenticated by default. Integrity protection and encryption are optional. To integrity protect the data, use the -dcsafe option. For encrypted data transfer, use the -dcpriv option.
Special note for transfer to/from TACC using GridFTP
Lonestar and Stampede at TACC do not support GridFTP authentication via SSH. To transfer data to/from TACC, please log on Lonestar or Stampede and load module "GLOBUS-5.0" and then run this command to transfer data from TACC to Ada.
globus-url-copy -vb -p4 file:/path/to/your_file sshftp://email@example.com/scratch/user/userid/
To transfer data from Ada to TACC, run
globus-url-copy -vb -p4 sshftp://firstname.lastname@example.org/scratch/user/userid/your_file file:/path/to/work_dir/
Data transfer using rsync
rsync is a fast, versatile, remote (and local) file-copying tool and recommended when relatively few differences exist between target and source versions, because rsync copies only the differences of files that have actually changed. By default, rsync uses the SSH remote shell.
To update data in localdir (the target area) in sync with that in remotedir (with relative path) on the remote server, issue the following:
rsync -av [-z] userid@remotesystem:remotedir/ localdir/
To sync data in localdir to remotedir (with absolute path) on the remote server,
rsync -av [-z] localdir/ userid@remotesystem:/path/to/remotedir/
- -av enable the "archive" and verbose modes.
- -z enables compression.
However, compression may not always yield a better transfer rate, especially when the transferring host is overloaded. Over a slow link, compression, many times, does yield better transfer rates.
Data transfer using scp
To transfer data from remote system,
scp userid@remotesystem:remotefile localdir/
To transfer data to remote system,
scp localfile userid@remotesystem:/path/to/remotedir/
To copy remote direcotry recursively, add "-r" option
scp -r userid@remotesystem:remotedir/ localdir/