Parallel BZIP2 Compression

This page documents the installation of the mpibzip2 tool. This tool can speed up the process of compressing large files. However, as the author points out, "This is a BETA version - Use at your own risk!" We can take no responsibility for data corruption that may take place by using this or any other tool. We don't anticipate any problems with this tool, but you need to be aware that it's a possibility.

Introduction

The mpibzip2 tool is a parallel implementation of the bzip2 compression algorithm, programmed by Jeff Gilchrist. It can be effectively used to compress (and sometimes decompress) large files by using multiple processors. It is compiled on our system to use the OpenMPI implementation of MPI, the default on our system.

How it works

The bzip2 compressed file format allows for multiple pieces to be concatenated together. The mpibzip2 takes advantage of this when compressing, by splitting up the large file into smaller pieces. Each of these pieces is then sent via MPI to different processors for compression. Each piece is compressed using the standard bzip2 libraries on the system. Each compressed piece is then sent back, and added to the final compressed file.

Because the file is split into pieces, and each piece is compressed separately, the resulting file may be larger than it would be if you compressed it using the standard bzip2 tools, since mpibzip2 cannot find patterns in common between two or more pieces.

When decompressing, the reverse process occurs. The compressed pieces are extracted from the file, and sent to various processors for decompression. Then the resulting decompressed pieces are sent back to be assembled into the decompressed file.

Important note: Files compressed using mpibzip2 can be decompressed using the standard bzip2 tools, and visa-versa. However, if your compressed .bz2 file does not contain multiple pieces, then there will be no speed advantage by decompressing in parallel. Since the standard tools generally only create one piece in the compressed file, most files will have no speed advantage when decompressing using mpibzip2. However, if you compressed your file using mpibzip2, you will probably see an advantage to using mpibzip2 to do the decompression.

Installation location

The installation on the BYU supercomputers, is located in the following directory:

/fslapps/mpibzip2/current/

The program is located at this path:

/fslapps/mpibzip2/current/bin/mpibzip2

The manual page, which describes the syntax, can be viewed using this command:

man /fslapps/mpibzip2/current/man/man1/mpibzip2.1

How to use it

Since mpibzip2 is an MPI application, for most purposes it should be run inside a job, using the standard mpirun or mpiexec launcher. For example, if you have a file named myreallybigfile, you could compress it inside a job using syntax like this:

mpirun /fslapps/mpibzip2/current/bin/mpibzip2 myreallybigfile

It's important to know that parameters being passed to the mpirun or mpiexec, and parameters being passed to mpibzip2, are placed in different locations in the command. For example, if I wanted to run the same command as shown above, but I wanted to pass the -n 2 option to mpirun (only run on 2 processors), and wanted to pass the -k option to mpibzip2 (don't delete the original file when done), I'd do it like this:

mpirun -n 2 /fslapps/mpibzip2/current/bin/mpibzip2 -k myreallybigfile

Using with tar

Frequently, people wish to aggregate large numbers of files together into a single compressed archive using the tar command, as described on this page. Unfortunately, while you can use mpibzip2 to compress a tar file, it has to be done in multiple steps. For example, if you had a directory named mybigdirectory that contained several files, and you wanted to use mpibzip2 to create a compressed tar file, you'd have to do it like this:

tar cvf mytarfile.tar mybigdirectory
mpirun /fslapps/mpibzip2/current/bin/mpibzip2 mytarfile.tar

Important note: The tar portion of this process does not run in parallel, so if you have huge numbers of files to put together, this may not be the most efficient approach. At this time, we do not have any standard utilities to run tar in parallel. If this is a significant concern, please contact us, and we'll see what we can figure out.

Performance considerations

The exact performance of a tool like mpibzip2 will depend on the operating conditions, and the data being compressed. However, tests seem to indicate that the tool is very sublinear in terms of speedup. For more about what this means, see this page.

In general, if you need to compress one large file, or a small number of large files, especially at the end of a processing job, then go ahead and use mpibzip2. However, if you have huge numbers of files to compress separately, then it would actually be more efficient (less cpu-time) to use the standard, serial tools, just not as a part of the same parallel job. For more information, feel free to contact us.

Additionally, during our testing, we did not see any measurable impact from using Infiniband-enabled nodes over Ethernet-only nodes. What differences we saw appeared to be related to the differing clock-speed on the corresponding processors.

Last changed on Tue Jul 3 09:13:44 2012