Long Term Storage Best Practices
The following is a list of practices which can improve the performance and usability of long term storage at ARSC.
- Storage resources are finite. Be judicious about the data you store. Periodically review your data holdings and carefully consider what you need; remove what is no longer needed.
- Avoid large numbers of files. Having hundreds of thousands of files in your $ARCHIVE can drastically degrade the performance of the long term storage processes. When possible, save related files into a single tar file.
- Large file transfers to $ARCHIVE: When copying more than one terabyte of data into $ARCHIVE, limit the number of streams (e.g. cp's) to one. This will allow the archiving daemon to keep up with the creation of tape copies while leaving tape drives available for other users.
Faster transfer speeds can be achieved by using rcp instead of cp. Historically, we have recommended using the bigdip-s host in rcp commands. This host is not available from Fish, so we now recommend using bigdipper instead. For example:
/usr/bin/rcp results.tar "bigdipper:$ARCHIVE"
- File transfers from $ARCHIVE: When transferring large numbers of files from $ARCHIVE to $CENTER or a remote location, staging the files using the batch_stage prior to the transfer will drastically reduce the time required to complete the transfer.
Faster transfer speeds can be achieved by using rcp instead of cp. For example:
/usr/bin/rcp "bigdipper:$ARCHIVE/results.tar" $CENTER
- Use the batch_stage command. When using batch_stage , run only one instance at a time. There are a limited number of tape drives shared by all users, therefore when the demand for tape drives is greater than the number of tape drives available, performance becomes drastically worse for everyone.
- Use tar files. If you have many files which will be used together (such as source code), use a tar file to store all of the files together.
- About creating tar files: When creating a tar file from files stored in $ARCHIVE, it is most effective to do this directly on bigdipper. The batch_stage command should be used to bring all files online prior to issuing the tar command.
- Large files: While terabyte sized files can be accommodated, it is generally best to keep file sizes under 250 gigabytes in order to reduce chances of problems during transfer, archiving and staging.
- Save to $ARCHIVE as you work. It is best to make copies of the files to $ARCHIVE as they are created, rather than saving the work up and copying them all at once. The archive system works better with regular smaller volumes of new data than it does with occasional floods of new data. This practice will also minimize the impact of a catastrophic failure of the temporary filesystem(s) on the HPC system.
- Verify transfers are successful when transferring files between systems. It is best to confirm the copy matches before removing the source file. Various utilities (' sum -r , ' md5 , etc.) can be used to check integrity of the file copy and source file. It is best to use these utilities on the system where the file resides in order to avoid NFS related issues (e.g. cache effects).
- Release files to tape when finished copying the files to/from $ARCHIVE. Using the release command will create more available disk space to allow batch_staged files to be brought online.