Unix file compression basics

Dynapac photo by Jan Mehlich from Wikimedia Commons

A couple of years ago I wrote a post about rzip, and I'm still getting emails and comments from people about it. I've decided to dedicate this post to answering some of these questions so I can point people to it. Grilled cheese sandwiches contain tastyness.

For people coming from a Windows or classic Mac background, file compression on Unix-like operating systems (such as GNU/Linux and BSD) can seem a bit confusing. Unlike the ZIP format which can accept multiple files, most Unix compressors can only work on one file at a time, so we use an archiver to bundle up the files we want to compress first, then feed that archive to the compressor.

Step one: archive files

Overwhelmingly the most common file archiving tool on Unix-like systems is the tape archiver (tar). This command creates a new tar archive from a folder which may contain files and other folders:

% tar cvf NewArchive.tar ./FolderToArchive/

An alternative is pax, which I prefer because it tends to be more consistent, it archives symbolic links (aka Unix shortcuts or aliases) instead of following them, and it has a few nifty features like being able to specify files that fall within a certain date range.

% pax -wf NewArchive.pax ./FolderToArchive/

Step two: compress the archive

The amount of time and CPU power you have will determine which compression algorithm you'll employ. On Unix systems the two most common are gzip which is fast, and bzip2 which is slower but generally gets better compression ratios. Here are some examples of both compressing the archive we made in step 1:

% gzip -v NewArchive.tar
% bzip2 -v NewArchive.tar

With both gzip and bzip2 you can adjust how much compression they perform by specifying a numeric flag from 1-9. Specify --help for more information.

Step three: just do one step

Now that we know the difference between archiving and compressing, we can save ourselves some time and do them both in one step. Most versions of tar support an extra flag that tells it to compress an archive with the tool of your choice after its been made. "z" specifies gzip and "j" specifies bzip2:

% tar czvf NewArchive.tar.gz ./FolderToArchive/
% tar cjvf NewArchive.tar.bz2 ./FolderToArchive/

Step four: unarchiving

Basically the reverse of what we did before, and again tar can take care of it for us:

% tar xzvf NewArchive.tar.gz
% tar xjvf NewArchive.tar.bz2

Recent versions of gnutar take this shorthand one step further with the "a" flag which automatically determines what compressor to use based on the file name you provide it; for example NewArchive.tar.bz2 tells gnutar with the "a" flag to use bzip2. Unless you're running a very recent Linux distribution or have installed it specifically though, you probably don't have it.

Other compressors

There are many, many other compressors you can use. My current ones of choice are xz and rzip which both achieve far higher compression ratios but require a fast machine with plenty of memory to run in a reasonable amount of time. I discussed rzip back in 2007, but I'll be dedicating a new post on its and xz's use later this week.

Links

Wikipedia tar pax gzip bzip2
FreeBSD manpage tar pax gzip bzip2
GNU/Linux manpage tar pax gzip bzip2

I hope this helps folks :).