All About Compression, Part I

by Ga'ash Soffer on January 7, 1999 2:57 PM EST

Lempel-Ziv (LZ) Compression

Lempel-Ziv compression takes advantage of the large amounts of repetitive data in a file. Take, for example, this article. Chances are that there are many occurrences of words such as "the", "compression", "LZ", etc. Wouldn't it be great if, each time we hit one of these common words, we could just put in a shorter code for this word? This is, in essence, what LZ compression does. Let's "fake LZ" compress the (perhaps,err, definitely nonsensical) text below:

The blah blah the blah that these blah blah the blah.

Now in our dictionary (which appears at the beginning of a LZ file) we would store the following:

The = T
blah = B
that = C
these = D

Our message above could not be compressed into something like this:

T B B T B C D B B T B

LZ compression explained in more detail

The above example is not "real" LZ compression. Real LZ compression has a dictionary (size varies depending on the max. size of an entry, and the number of entries.) The typical example used is a 4096 entry dictionary (so everything in it can be coded in 12bits) which reserves the first 255 entries for single bytes, and the remaining entries for repetitive strings.

How does LZ compression find out repetitive strings? Well, the simple form of LZW compression, which is the one I will discuss now, does not know what is a commonly used string and what is not. Here is how it works:

Read initial character into "previous" string.

While there are characters

Read in a character
If the character + the previous n (max stored string size) characters + this current character are in the dictionary,
add this character to our previous n
otherwise, output the code we made for our previous n characters, add the n previous characters + this current character to the dictionary
.

You can probably see where the compression comes into play. If you have n (i.e. lots of previous characters still in the buffer) when you hit a code in the dictionary, all those characters will be replaced by the code, which is generally around 12bits in size. Looking at this algorithm you might also notice that for small amounts of data (i.e. less than 150 bytes) LZ compression is useless. On large files with lots of repetitive data, LZ compression is very efficient. It is also exceptionally efficient at compressing graphics, since there is generally lots of repeating patterns which are easily compressed.

LZ77 Compression scheme

LZ77, another type of Lempel-Ziv compression algorithm works by looking ahead into the file, and if it sees a pattern it recognizes, it will write the previous position of that match in the file instead of the actual data. The output is then generally compressed using Huffman Coding (hence then name LZH, Lempel-Ziv Huffman) which will deal with the cases where LZ77 compression was not very successful. Here variable length Huffman encoding as mentioned last page can potentially be very helpful. LZH, or LZA (LZ + Arithmetic Coding, another compression algorithm which will be discussed in part 2) Compression are the main methods used in all the popular compressors, such as ZIP, ARJ, LHA, etc.

In part 2, Decompression will be discussed in detail, and then, in part 3 (or 4), what you are all probably waiting for, I will talk about how these compression algorithms are applied (or not applied!) in various image formats (GIF, JPG, etc.), compressed Audio, and Compressed MPEG.

Acknowledgements:

http://www.cis.ohio-state.edu/hypertext/faq/usenet/compression-faq/part2/faq-doc-1.html

Data Compression, http://www.ics.uci.edu/~dan/pubs/DC-Sec1.html

Mark Nelson's Homepage, http://dogma.net/markn/

Tim Sweeney, Lead Programmer, Epic MegaGames.

Index
Comments Locked

0 Comments

View All Comments

Log in

Don't have an account? Sign up now