Unix file compression basics
SoftwareA couple of years ago I wrote a post about rzip, and I'm still getting emails and comments from people about it. I've decided to dedicate this post to answering some of these questions so I can point people to it. Grilled cheese sandwiches contain tastyness.
For people coming from a Windows or classic Mac background, file compression on Unix-like operating systems (such as GNU/Linux and BSD) can seem a bit confusing. Unlike the ZIP format which can accept multiple files, most Unix compressors can only work on one file at a time, so we use an archiver to bundle up the files we want to compress first, then feed that archive to the compressor.
Step one: archive files
Overwhelmingly the most common file archiving tool on Unix-like systems is the tape archiver (tar
). This command creates a new tar archive from a folder which may contain files and other folders:
% tar cvf NewArchive.tar ./FolderToArchive/
An alternative is pax
, which I prefer because it tends to be more consistent, it archives symbolic links (aka Unix shortcuts or aliases) instead of following them, and it has a few nifty features like being able to specify files that fall within a certain date range.
% pax -wf NewArchive.pax ./FolderToArchive/
Step two: compress the archive
The amount of time and CPU power you have will determine which compression algorithm you'll employ. On Unix systems the two most common are gzip
which is fast, and bzip2
which is slower but generally gets better compression ratios. Here are some examples of both compressing the archive we made in step 1:
% gzip -v NewArchive.tar % bzip2 -v NewArchive.tar
With both gzip
and bzip2
you can adjust how much compression they perform by specifying a numeric flag from 1-9. Specify --help
for more information.
Step three: just do one step
Now that we know the difference between archiving and compressing, we can save ourselves some time and do them both in one step. Most versions of tar support an extra flag that tells it to compress an archive with the tool of your choice after its been made. "z" specifies gzip and "j" specifies bzip2:
% tar czvf NewArchive.tar.gz ./FolderToArchive/ % tar cjvf NewArchive.tar.bz2 ./FolderToArchive/
Step four: unarchiving
Basically the reverse of what we did before, and again tar can take care of it for us:
% tar xzvf NewArchive.tar.gz % tar xjvf NewArchive.tar.bz2
Recent versions of gnutar take this shorthand one step further with the "a" flag which automatically determines what compressor to use based on the file name you provide it; for example NewArchive.tar.bz2
tells gnutar with the "a
" flag to use bzip2. Unless you're running a very recent Linux distribution or have installed it specifically though, you probably don't have it.
Other compressors
There are many, many other compressors you can use. My current ones of choice are xz and rzip which both achieve far higher compression ratios but require a fast machine with plenty of memory to run in a reasonable amount of time. I discussed rzip back in 2007, but I'll be dedicating a new post on its and xz's use later this week.
Links
Wikipedia | tar | pax | gzip | bzip2 |
---|---|---|---|---|
FreeBSD manpage | tar | pax | gzip | bzip2 |
GNU/Linux manpage | tar | pax | gzip | bzip2 |
I hope this helps folks :).