Backing Up Your Data
Backups are important--anything that would cause a significant setback if lost should be backed up. Home directories, both for groups and individual users, are backed up automatically by us, so data there can be recovered in case of any disaster short of simultaneous destruction of multiple buildings on campus. Compute directories, however, are not--they are meant to be used as scratch space, never storing more than a couple weeks worth of data at a time; this is the industry standard. If you are dealing with more than a terabyte of data that needs to be backed up, you will have to set up backups of your compute directories.
General principles
Keep in mind that setting up backups is hard work--expect to spend at least several hours readying a backup solution that fits your needs. If your data is important and can't fit in your home directory, though, that time is a small price to pay.
No backup system is truly fire-and-forgot--authentication tokens expire, servers go down, software updates cause breaking changes, etc. Make sure to regularly check that your backup system is working as intended. "My backup solution was working two years ago" is all too common a refrain when disaster strikes, so schedule a time once a month to make sure that you can restore your data. It can be helpful to automatically send a message to yourself when your backup fails, for example by sending an email or by creating a file titled BACKUP_IS_FAILING_NEEDS_IMMEDIATE_ATTENTION
on your desktop.
Available backup solutions
The simplest backup is a copy--transferring files from the supercomputer to another machine or to the cloud manually. Such a backup, without automation or snapshots, is not reliable because it is very easy to neglect manual backups; we strongly recommend that you set up automatic backups. cron
can be used to automatically run backups at daily, weekly, or monthly intervals.
Slightly more sophisticated is a tool that mirrors what is being backed up, but stores old versions of files that have been changed or deleted. This has the advantage of being trivially recoverable without any tools, but is usually slower and more storage-intensive than a true backup solution. rsync
can do this with the --backup
and --backup-dir
arguments; if you would like to backup to the cloud, rclone
sync
with the --backup-dir
argument will accomplish the same purpose.
Unless you are backing up very little data or your data is nearly static, you should use a real backup tool. As of this writing, we recommend using Kopia for backups unless you're already familiar with another backup tool. If you know of a stable, easy-to-use backup solution that has significant advantages over Kopia, let us know and we'll include it in this article.
Backup methods
There are three fundamental ways to back up data on the supercomputer:
- Push data from the supercomputer to the cloud
- Push data from the supercomputer to another machine
- Pull data from the supercomputer to another machine
What you choose depends on the resources at your disposal and your backup profile, but for most people we recommend backing up to the cloud since it means low maintenance and easy expansion upon hitting capacity.
Last changed on Tue Jan 23 12:01:32 2024