If you have ever unzipped a download, viewed a GIF or JPEG
file, watched an MPEG video or played an MP3 audio file, you have experienced data
compression. Data compression is arguably one of the most important aspect of the
internet/multimedia. Without data compression, 5 minute MPEG clips would take up tens of
even hundreds of megabytes, 1 minute CD-quality audio files would take up around 10
megabytes of hard drive space, game demos would take two times as long to download, and
web page images would take up to ten times longer to load! This article isn't going to
focus on all the wonderful things possible with data compression, or all the future
applications of data compression. Instead, this article will focus on explaining data
compression so that you, the reader, will know HOW it works, not just that "it
works". Also, after reading this article you can show off to your friends with
phrases like "Lempel-Ziv Compression", and astonishing trivia questions like
"Did you know that Huffman coding binary trees are 'full'?".
Lossless Compression Techniques
The first types of compression I am going to discuss are lossless compression techniques. These compression techniques allow the uncompressor to recreate the original data byte for byte. These types of compression are used predominately in file compression (PKZip, RAR, ARJ, LHA, etc.), though some of them are used in various forms of data compression because they are relatively easy to implement and very effective on certain data sets. Perhaps the most commonly used data compression algorithm is Huffman coding.
Huffman coding is based upon a simple philosophy: The most common data should be noted with the least number of bits, while the least common data should be noted with the most number of bits. After a statistical analysis of the file (or block) is done, the data is stored in a binary tree with the most common elements as close to the root as possible. Then, in order to find the coding sequence, if we traverse to the left of the tree in search for our character, we add a 0, if we traverse to the right of the tree, we add a 1. Below is an example of Huffman Compression at work.
String to Compress: "aabacbacdb"
a - 4
b - 3
c - 2
d - 1
(1) start with 4 trees with 1 node in each, then merge the two smallest trees, and insert the result into the new list, repeat.
4 4 10
3 3 6
The resulting probability tree would look like the one below: (Note that this tree is always "full" (each node has 0 or 2 children))
Now, in order to come up with the notation for each character. Traverse the tree until you hit the character you are looking for. While traversing, every time you go to the left node, mark a '0', otherwise, mark a '1'.
Using this, finding the coding pattern is trivial:
a = 1
b = 01
c = 000
d = 001
Our string above, "aabacbacdb" would convert into '1101100001100000101'. Here we have 19bits, instead of 8*10 = 80bits in the original message. Of course we also have to store the tree for uncompression (or the code for each character) but as soon as the block compressed reaches a size of 32K or so, this extra space used is negligible.
Optimizing Huffman Compression
The main method used to optimize Huffman Compression is to combine it with LZ compression (which will be discussed next). There are; however, ways to optimize Huffman coding itself. The tree algorithm shown here (the one for generating the coding for each letter) is the most optimal; however, in order for Huffman coding to be most efficient, we must have large gaps between the amount of 1 letter as opposed to the other. The ideal statistical distribution for Huffman compression with a significant number of letters present is a division by powers of 2, i.e. 1/2 of 1 letter, 1/4 of another, etc..) Since we know what kind off statistics we want, we can choose blocks which will best reflect this statistical distribution. Let's take the data below.
If we were to compress this with a block size of 16, we would not be able to compress the data that much, since all the data is distributed equally. If we decided to make our block size 4, assuming that storing the tree (or equivalent Huffman code for each character) is negligible (which it is in most real world cases) a block size of 4 would allow us to code each character in the above in 1 bit.
Now, what if a static block size won't work? Well, if we are doing Huffman compression alone, it is very feasible (albeit slow, or fast and difficult) to calculate what will be the most efficient block size dynamically. This means that every block will have a different size, the size which is most efficient. According to Tim Sweeney's, lead programmer of EPIC's kick a-- Unreal, response to an e-mail I fired off to him yesterday:
"A few simple attempts gained an additional
20% reduction in compressed bits,
but that was offset by having to store multiple Huffman tables -- so the net
result was about 0% reduction."
Tim Sweeney goes on to mention that:
"I think this could potentially be better
than Huffman alone, but I think the gain will be limited to maybe 20-30%.
Most kinds of data (such as text, graphics, program code) don't seem to have
much more local coherence than global coherence. LZH schemes will probably
beat this approach consistently."
This comes to the next point I will discuss: possible ways to work with dynamic block size when doing Huffman + LZ77 (LZH) compression, along with an explanation of what LZ(H) compression is.