Data Transfer Guide
- Data Transfer Speed
- Using the RDF
- Data Transfer via SSH
- Support
This page gives an overview of the different mechanisms for transferring data to and from ARCHER, the UK-RDF and remote machines.
Data Transfer Speed
Data transfer speed may be limited by many different factors so the best data transfer mechanism to use depends on the type of data being transferred and where the data is going.
- Disk speed - The ARCHER /work file-systems and the RDF file-systems are highly parallel consisting of a very large number of high performance disk drives. This allows them to support a very high data bandwidth. Unless the remote system has a similar parallel file-system you may find your transfer speed limited by disk performance.
- Meta-data performance - Meta-data operations such as opening and closing files or listing the owner or size of a file are much less parallel than read/write operations. If your data consists of a very large number of small files you may find your transfer speed is limited by meta-data operations. Meta-data operations performed by other users of the system will interact strongly with those you perform so reducing the number of such operations you use, may reduce variability in your IO timings.
- Network speed - Data transfer performance can be limited by network speed. More importantly it is limited by the slowest section of the network between source and destination.
- Fire-wall speed - Most modern networks are protected by some form of fire-wall that filters out malicious traffic. This filtering has some overhead and can result in a reduction in data transfer performance. The needs of a general purpose network that hosts email/web-servers and desktop machines are quite different from a research network that needs to support high volume data transfers. If you are trying to transfer data to or from a host on a general purpose network you may find the fire-wall for that network will limit the transfer rate you can achieve.
Using the RDF
The Research Data Facility (RDF) consists of 7.8PB disk, with an additional 19.5 PB of backup tape capacity. The RDF is external to the national services, and is designed as long term data storage. The RDF file-systems are directly mounted on the ARCHER login nodes and the nodes used to run serial batch jobs. These file-systems are not visible from the compute nodes. The RDF has 3 filesystems:
/general /epsrc /nerc - no longer available
The file-system a user has access to depends on their funding body.
Archiving
If you have related data that consists of a large number of small files it is strongly recommended to pack the files into a larger "archive" file for long term storage. A single large file makes more efficient use of the file-system and is easier to move and copy and transfer because significantly fewer meta-data operations are required. Archive files can be created using tools like tar, cpio and zip. When using these commands to prepare a file for the RDF, it is good practice to forgo compression as this will slow the archiving process.
tar command
The tar command packs files into a "tape archive" format intended for backup purposes. The command has general form:
tar [options] [file(s)]
Common options include -c "create a new archive", -v "verbosely list files processed", -W "verify the archive after writing", -l "confirm all file hard links are included in the archive", and -f "use an archive file" (for historical reasons, tar writes its output to stdout by default rather than a file). Putting these together:
tar -cvWlf mydata.tar mydata
will create and verify an archive ready for the RDF. Further information on the hard link check can be found in the tar manual.
To extract files from a tar file, the option -x is used. For example:
tar -xf mydata.tar
will recover the contents of "mydata.tar" to the current working directory.
To verify an existing tar file against a set of data, the -d "diff" option can be used. By default, no output will be given if a verification succeeds and an example of a failed verification follows:
$> tar -df mydata.tar mydata/* mydata/damaged_file: Mod time differs mydata/damaged_file: Size differs
Note that tar files do not store checksums with their data, requiring the original data to be present during verification.
cpio command
The cpio utility is a common file archiver and is provided by most Linux distributions. The command has form:
cpio [options] < in > out
Note cpio uses stdin and stdout for its input and output functionality. The utility does not provide a "recursive" flag like tar and zip and is hence often used with the find command when working with directories.
Common options include -o "create an archive (copy-out mode)", -v "verbose mode", and -H "use the given archive format". The recommended format is crc as this provides checksum support at the cost of compatibility with older versions of cpio. Together:
find mydata/ | cpio -ovH crc > mydata.cpio
will create an archive ready for the RDF.
Extraction is performed via the -i "copy-in" flag usually paired with -d to ensure directories are created as needed. For example:
cpio -id < mydata.cpio
recovers the contents of the archive to the working directory.
Archive verification can be performed in -i mode with the --only-verify-crc flag set. As the name implies, this skips the file extraction and only verifies the checksum for each file in the archive. An example of this on a damaged archive follows:
$> cpio -i --only-verify-crc < mydata.cpio cpio: mydata/file: checksum error (0x1cd3cee8, should be 0x1cd3cf8f) 204801 blocks
zip command
The zip file format is widely used for archiving files and is supported by most major operating systems. The utility to create zip files can be run from the command line as:
zip [options] mydata.zip [file(s)]
Common options are -r used to zip up a directory and -# where "#" represents a digit ranging from 0 to 9 to specify compression level, 0 being the least and 9 the most. Default compression is -6 but we recommend using -0 to speed up the archiving process. Together:
zip -0r mydata.zip mydata
will create an archive ready for the RDF. Note: Unlike tar and cpio, zip files do not preserve hard links. File data will be copied on archive creation, e.g. an uncompressed zip archive of a 100MB file and a hard link to that file will be approximately 200MB in size. This makes zip an unsuitable format if you wish to precisely reproduce the file system.
The corresponding unzip command is used to extract data from the archive. The simplest use case is:
unzip mydata.zip
which recovers the contents of the archive to the working directory.
Files in a zip archive are stored with a CRC checksum to help detect data loss. unzip provides options for verifying this checksum against the stored files. The relevant flag is -t and is used as follows:
$> unzip -t mydata.zip Archive: mydata.zip testing: mydata/ OK testing: mydata/file OK No errors detected in compressed data of mydata.zip.
Local Copy from ARCHER
Because the RDF file-systems are directly mounted on the ARCHER login nodes standard commands such as cp and rsync can be used to copy files across from the /home and /work file-systems. You should use these rather than network transfer tools as these are usually faster.
cp command
Using the cp command creates a copy of a file, or if given the -r flag a directory, at the given destination. This can be run from the command line, as follows:
cp [options] source destination
However, if you are transferring a large amount of data, you may wish to use the serial nodes on ARCHER. In this case you should use a submission script, for example:
#!/bin/bash --login # #PBS -l select=serial=true:ncpus=1 #PBS -l walltime=00:20:00 #PBS -A [budget] cd $PBS_O_WORKDIR cp [-r] source destination
In the above script 'source' should be the absolute path of the file/directory being copied or the script should be stored in and submitted from the directory containing the source file/directory.
If you want the batch job to run after another batch job has completed, for example to move the results generated by a parallel job, you can do this by specifying a dependency in the qsub flags
$ qsub -W depend=afterok:previous-job-id copyscript.pbs
You should not use the mv command to move data between file-systems. Within a single file-system this command is very fast as it just renames the file or directory. When moving between file-systems it is equivalent to copies followed by deletes. There is therefore absolutely no speed advantage and it is much safer to perform the delete later once you are sure the data has been copied correctly.
rsync command
The rsync command creates a copy of a file, or if given the -r flag a directory, at the given destination, as with the example above. However, this is the form used when performing a 'local' copy, to a directly mounted file-system. The general form for local copies made with rsync is:
rsync [options] source destination
Again for the transfer of a large amount of data, you may wish to use the serial nodes on ARCHER. In this case you should use a submission script, for example:
#!/bin/bash --login # #PBS -l select=serial=true:ncpus=1 #PBS -l walltime=00:20:00 #PBS -A budget cd $PBS_O_WORKDIR rsync [-r] source destination
In the above script 'source' should be the absolute path of the file/directory being copied or the script should be stored in and submitted from the directory containing the source file/directory.
Because rsync attempts to 'mirror' directories between the two machines, transferring directories containing large numbers of files will result in a large number of meta-data operations. This can significantly reduce performance of data transfers. However rsync can still a good choice when re-synchronising a previously copied directory that contains very large files, as rsync will not move files that already exist (and have the correct size and date) at the destination. If your file sizes are fairly small (less than a GB) then the extra meta-data operations needed might be more expensive than the time saved so a simple cp that overwrites all the data might be faster.
Data Transfer via SSH
The easiest way of transferring data to or from ARCHER is to use one of the standard programs based on the SSH protocol such as scp, sftp or rsync. These all use the same underlying mechanism (ssh) as you normally use to log-in to ARCHER. So, once the the command has been executed via the command line, you will be prompted for your password for the specified account on the remote machine. To avoid having to type in your password multiple times you can set up a ssh-key as documented in the user-guide.
The ssh command encrypts all traffic it sends. This means that file-transfer using ssh consumes a relatively large amount of cpu time at both ends of the transfer. The login nodes for ARCHER and RDF have fairly fast processors that can sustain about 100 MB/s transfer but you may have to consider alternative file transfer mechanisms if you want to support very high data rates. The encryption algorithm used is negotiated between the ssh-client and the ssh-server. There are command line flags that allow you to specify a preference for which encryption algorithm should be used. You may be able to improve transfer speeds by requesting a different algorithm than the default. The arcfour algorithm is usually quite fast if both hosts support it.
A single ssh based transfer will usually not be able to saturate the available network bandwidth or the available disk bandwidth so you may see an overall improvement by running several data transfer operations in parallel. To reduce meta-data interactions it is a good idea to overlap transfers of files from different directories.
SSH from ARCHER to ARCHER2
For details of Archer to Archer2 transfers, please see https://docs.archer2.ac.uk/archer-migration/data-migration/
scp command
The scp command creates a copy of a file, or if given the -r flag a directory, on a remote machine. Below shows an example of the command to transfer files to ARCHER:
scp [options] source [email protected]:[destination]
In the above example, the [destination] is optional, as when left out scp will simply copy the source into the users home directory. Also the 'source' should be the absolute path of the file/directory being copied or the command should be executed in the directory containing the source file/directory.
If you want to request a different encryption algorithm add the -c algorithm-name flag to the scp options.
If you need to run scp from within a batch job see special instructions on how to use ssh-keys from batch jobs
rsync command
The rsync command can also transfer data between hosts using a ssh connection. It creates a copy of a file, or if given the -r flag a directory, at the given destination, similar to scp above. However, given the -a option rsync can also make exact copies (including permissions), this is referred to as 'mirroring'. In this case the rsync command is executed with ssh to create the copy of a remote machine. To transfer files to ARCHER the command should have the form:
rsync [options] -e ssh source [email protected]:[destination]
In the above example, the [destination] is optional, as when left out rsync will simply copy the source into the users home directory. Also the 'source' should be the absolute path of the file/directory being copied or the command should be executed in the directory containing the source file/directory.
Additional flags can be specified for the underlying ssh command by using a quoted string as the argument of the -e flag. e.g.
rsync [options] -e "ssh -c arcfour" source [email protected]:[destination]
Support
If you have any questions about copying your data to/from ARCHER or the RDF, please contact the ARCHER helpdesk via [email protected].