The same storage system that gave us problems on Wednesday hung again Friday evening. This time we did not lose any nodes. We know what is causing the issue now and are working with the vendor on a software fix. Running jobs had time added to compensate for the hang. Last Updated Friday, Jan 17 06:50 pm 2020

Rclone

Rclone allows one to move files and directories to and from cloud storage via the command line. In combination with box.byu.edu, where BYU students and faculty get unlimited free storage, it can make storing and backing up archival data much easier. Rclone+Box will help users who routinely run up against storage space constraints and who wish to back up data that can only fit in compute. Those who wish to collaborate without making others get ORC accounts can upload to Box with Rclone, then share their data with collaborators (even if those collaborators don't have Box accounts).

This tutorial will show how to configure Rclone with Box, a few of the most useful commands, and a couple of worked examples. It is by no means comprehensive, so those wanting to learn more should reference the documentation, which is excellent.

Note that while the storage on Box is unlimited, expansive storage comes at a cost: Box is slow (especially with small files), so it can take a while to move big chunks of data; if you have many small files, we recommend backing up with Restic, which aggregates small files and will significantly speed up transfer. Additionally, files stored on Box cannot exceed 32 GB in size, although rclone easily works around this with its chunker overlay.

To use Rclone on the supercomputer, you'll need to load the rclone environment module with module load rclone.

Configuration

Keep in mind that Rclone need only be configured once--as soon as you've finished the steps below, you should never need to do so again as long as you use it at least monthly. You'll need to download Rclone on your local machine, unless you would like to forward a port and configure on the remote node as if it were local (this method is less reliable).

rclone authorize box

To use Rclone with Box, one needs to get an authorization token, which acts somewhat like a password and allows Rclone to access your files. To get such an authorization token, you'll install Rclone on your machine, log in to Box in your default browser, then run rclone authorize box on your local machine. If you're running Linux or *BSD, your package manager probably has Rclone available; for Windows or MacOS (or Linux/*BSD if you prefer), you can download generic binaries directly from the website.

Assuming you're logged in to Box in your browser, running rclone authorize box should send you to a screen in your browser with a big blue Grant access to Box button--click it, and you should be greeted with a success message. Back in the terminal or command prompt, everything between ---> and <--- is the authorization token and associated information:

If your browser doesn't open automatically go to the following link: http://127.0.0.1:53682/auth
Log in and authorize rclone for access
Waiting for code...
Got code
Paste the following into your remote machine --->
{"access_token":"XXXXX","token_type":"bearer","refresh_token":"XXXXX","expiry":"2019-01-01T00:00:00-06:00"}
<---End paste

For those who aren't familiar with command-line applications, the following sets of commands will download Rclone (without need for administrative access) and run rclone authorize box when pasted into a terminal/command prompt:

Windows 10:

set arch=amd64
wmic os get OSArchitecture | findstr 32 && set arch=386
set rclone_zip=rclone-current-windows-%arch%.zip
set rclone_dir=%temp%\orc-rclone-box-auth-%time::=_%
mkdir %rclone_dir%
cd %rclone_dir%
curl https://downloads.rclone.org/%rclone_zip% -O
powershell.exe "Expand-Archive %rclone_zip% ."
cd rclone-*-windows-%arch%
rclone.exe authorize box

MacOS/Linux/*BSD:

$SHELL
set -e
os="$(uname | tr '[:upper:]' '[:lower:]')"
[[ "$os" == darwin ]] && os=osx
case "$(uname -m)" in
    x86_64|amd64) arch=amd64;;
    i*86) arch=386;;
    arm) arch=arm;;
    aarch64*|armv8*) arch=arm64;;
    mips*) arch=mips$(lscpu | awk '$0~"Little Endian" {print "le"}');;
    *) echo "ERROR: architecture not supported"; exit 1;;
esac
rclone_dir="$(mktemp -d)"
rclone_zip="rclone-current-$os-$arch.zip"
cd "$rclone_dir"
curl "https://downloads.rclone.org/$rclone_zip" -O \
    || { echo "ERROR: precompiled binary not available"; exit 2; }
unzip "$rclone_zip"
cd rclone-*-"$os-$arch"
chmod +x rclone
./rclone authorize box
echo "Successfully authorized Box" && exit

Note that on CPU architectures outside of x86-64 and i386 and operating systems outside of MacOS, Linux, and the most common BSDs, these commands may fail since precompiled binaries are only available for some combinations of architecture and operating system. In that case, you'll probably need to install from source (which you're probably used to anyway).

Once the last command has finished, feel free to delete rclone_dir--you're done using Rclone on your local machine.

After you've run rclone authorize box, you can move on to configuration; the easiest way is to use our default rclone.conf, but you can also run rclone config manually if you want more customization.

The Easy Way

Rclone stores its data in a configuration file located at ~/.config/rclone/rclone.conf. Since most people will use roughly the same configuration to access files on Box, we provide a default rclone.conf here:

[box]
type = box
token = PASTE_TOKEN_HERE

[boxChunk]
type = chunker
remote = box:orc-chunker-remote
chunk_size = 30G
hash_type = sha1

This describes 2 "remotes": your Box storage (the [box] section), and a chunker remote that overlays your Box storage (the [boxChunk] section). This chunker remote allows files greater than 32GB to be stored on Box by transparently splitting such files (see the Chunker section below).

Now that you have your authorization token (see rclone authorize box above), simply copy the above default file to ~/.config/rclone/rclone.conf on the supercomputer (you'll probably need to create the ~/.config/rclone directory first) and replace PASTE_TOKEN_HERE with the authorization token, which will look something like {"access_token":"XXXXX","token_type":"bearer","refresh_token":"XXXXX","expiry":"2019-01-01T00:00:00-06:00"} (make sure to include the curly braces). To use the chunker remote you'll need to create the orc-chunker-remote directory with rclone mkdir box:orc-chunker-remote; you can of course use a different directory name as long as you edit rclone.conf accordingly.

rclone config

This is an alternative to using our sample rclone.conf; it will allow you to choose different names and options than those we provide as default.

To access Rclone, log in to the supercomputer and load the rclone module:

module load rclone

Once that's done, run rclone config. This will give you a few options:

No remotes found - make a new one
n) New remote
s) Set configuration password
q) Quit config
n/s/q> n

Enter n to make a new remote. Give it a name (e.g. box), then choose which storage service you'd like to configure (you can type box for box.byu.edu, drive for Google Drive, etc.).

It'll ask for Box App Client Id and Box App Client Secret; most users should simply hit enter to leave these blank. You'll then be asked if you want to "Edit advanced config" (most users should enter n):

Edit advanced config? (y/n)
y) Yes
n) No
y/n> n

Next, you will be asked whether to use auto config:

Remote config
Use auto config?
 * Say Y if not sure
 * Say N if you are working on a remote or headless machine
y) Yes
n) No
y/n> n

Since you are working on a remote machine, enter n. You will then be presented with a message prompting you to run rclone authorize "box" on your local machine:

For this to work, you will need rclone available on a machine that has a web browser available.
Execute the following on your machine:
    rclone authorize "box"
Then paste the result below:
result>

Run rclone authorize "box" in a command prompt on your local machine (see rclone authorize box above) and paste the authorization information (of the form {"access_token":"XXXXX","token_type":"bearer","refresh_token":"XXXXX","expiry":"2019-01-01T00:00:00-06:00"}) after result> on the remote terminal:

result> {"access_token":"XXXXX","token_type":"bearer","refresh_token":"XXXXX","expiry":"2019-01-01T00:00:00-06:00"}
--------------------
[box]
type = box
client_id = 
client_secret = 
token = {"access_token":"XXXXX","token_type":"bearer","refresh_token":"XXXXX","expiry":"2019-01-01T00:00:00-06:00"}
--------------------
y) Yes this is OK
e) Edit this remote
d) Delete this remote
y/e/d> y

After entering y, you're finished configuring Rclone to work with Box. You'll almost surely want to set up a chunker overlay (see Chunker below) and, if you're working with sensitive data, you may want to set up an encrypted overlay on top of that (see Crypt below).

Special Remotes

Chunker

The chunker overlay automatically splits files larger than a certain threshold on upload, which allows us to use Box to store files larger than 32GB. When downloading, the split files are automatically rejoined. We strongly recommend using chunker--it has no real downsides and preempts the frustration of cryptic upload errors when one tries to move a big file up to Box.

Big files are split into pieces transparently--if you download them outside of the chunker overlay and thus don't get the benefit of automatic concatenation, you can just use cat to put them together. Split files are named filename.rclone-chunk.000, filename.rclone-chunk.001, etc., so a simple cat filename.rclone-chunk.* > filename is all that's needed to combine them. Note that you don't need to do this if you access the files via your chunker remote.

You can copy the [boxChunker] section of the example config file in The Easy Way above into your ~/.config/rclone/rclone.conf and create the corresponding directory in Box to create a chunker remote.

Crypt

The crypt overlay encrypts files in a given remote directory such that they cannot (assuming a good password) be read even if someone obtains the files; a downside of this is that you can't simply download the files from box.byu.edu and use them directly. As such, we don't recommend the crypt overlay for most users. If you do decide to use it, make sure to check whether such a setup satisfies any regulations your data is governed by.

To set up a crypt remote, create a remote directory for it (e.g. boxChunk:crypt-remote), run rclone config, choose a name (e.g. boxCrypt), crypt storage type, then follow the prompts to finish configuring. File name obfuscation has slight usability disadvantages, but "standard" obfuscation is the most secure. You'll choose a password and a salt (please don't neglect the salt), and your crypt remote is configured.

Usage

This tutorial will only cover the basics due to the clarity and breadth of Rclone's exceptional documentation, which should be your first resource when learning its usage. Typing rclone --help to see a good synopsis of each command. For help on a specific command, you can also use rclone <command> --help (e.g. rclone copy --help).

Listing files

Rclone gives a few methods for listing files; none of them are quite like Unix's ls, but rclone lsf --max-depth 1 remote:path/to/dir comes close. A few more examples:

# Recursively list all files at "boxChunk" remote
rclone ls boxChunk:

# Show directories in "mydir" at "boxChunk"
rclone lsd boxChunk:mydir

# Recursively list files in "mydir/dir1" at "boxChunk" with more detail
rclone lsl boxChunk:mydir/dir1

Moving and Copying

rclone copy and rclone move behave a essentially like Unix's cp and mv, respectively; you can copy and move to or from the remote. Example usage:

# Copy remote file, mydata.txt, from "mydir" at "boxChunk"
rclone copy boxChunk:mydir/mydata.txt $HOME/data/

# Move a tarball from compute to "mydir/compute-backup" at "boxChunk"
rclone move ~/compute/my-tarball.tar.gz boxChunk:mydir/compute-backup

Creating Directories

rclone mkdir behaves like Unix's mkdir; to create a new directory on a remote, you would use something like:

rclone mkdir boxChunk:mydir/myNewDirectory

Examples

Move Archival Data to Box

Say you have a directory with data that needs to be kept, but you don't expect to do any work on it with the supercomputer, and you're running out of space. You can either move it directly, or compress it and move it. Moving it directly is easier and you'll be able to look at the data directly at box.byu.edu, but compressing then moving could be faster.

Generally, if you have a few big files you won't be slowed down too much by copying directly, but if you have many small files it will take a long time. Under ideal conditions, you can copy 4 files per second (across all processes--Box limits transfers by user). If you have a million files, that means it will take at least a few days to transfer them, no matter how small they each are.

To move without compressing, simply use:

rclone move ~/compute/dataset boxChunk:mydir/dataset

There are two main ways to compress then move data. This one is slower and more reliable:

tar -czf dataset.tar.gz ~/compute/dataset
rclone move dataset.tar.gz boxChunk:mydir/dataset.tar.gz

This one is faster and doesn't use significant disk space, but the work will be lost of the command is interrupted:

tar -czf - ~/compute/dataset | rclone rcat boxChunk:mydir/dataset.tar.gz

Backup compute with Box

Perhaps you have a large set of data in ~/compute/dataset, which is too big to fit in your home directory, that you would like to back up weekly. Say you set up the following directory structure to store the backups:

boxChunk:
'-- backup
    '-- dataset
        '-- old
        '-- primary

...by running:

rclone mkdir boxChunk:backup
rclone mkdir boxChunk:backup/dataset
rclone mkdir boxChunk:backup/dataset/old
# primary will be created by the copy

The current backup will live at boxChunk:backup/dataset/primary, while older snapshots, organized by date, will go in boxChunk:backup/dataset/old/. To get started, let's copy over dataset to the current backup directory at boxChunk:backup:

rclone copy ~/compute/dataset boxChunk:backup/dataset/primary

Keep in mind that Box is slow, so this may take some time. If you want to exit your ssh session while the copy is going, you may want to use screen or tmux to make the transfer.

Once the copy is done, you'll need to back up every week (or however frequently you would like to). This could go something like:

module load rclone
PRIMARY=boxChunk:backup/dataset/primary
OLD=backup/dataset/old/dataset-$(date +%F_%H-%M)
screen -dm rclone sync $HOME/compute/dataset $PRIMARY --backup-dir $OLD
# using `screen -dm ...` means that rclone will keep going even if you log out

If you want to do this regularly, you can put it in a script and run it at your convenience; you can use cron to run it automatically at regular intervals. To make the script (we'll call it do_rclone_backup.sh) execute weekly, use crontab -e to edit your crontab and enter something along the lines of 0 X * * Y bash /path/to/do_rclone_backup.sh (replacing X with an hour, 0-24, and Y with a day of the week, 0-6). Your backup script will now run once a week with no intervention from you. This tutorial goes into more depth in case you want to back up more or less frequently or would like to learn more about cron generally.