LZ compression

The LZ compression family include a large variety of lossless data compression algorithms, all derived from a very simple algorithm created by Abraham Lempel and Jacob Ziv in the late '70s, known as LZ77.



The algorithm is elegant and very dynamic and works excellent on very large input streams, producing the encoded output right away, ready to be decoded at the other end. Many websites are compressed with this algorithm, which reduces their size considerably with almost no cost regarding the decompression time.

In this post, we are going to take a look at a new algorithm, very similar with LZSS, called LZT.

Compression

The compress() function takes two arguments: the input stream, which can have any length, and the output stream which is ready to receive the encoded data chucks.

Input : TOBEORNOTTOBEORTOBEORNOT#
Output: TOBEORNOT[0,6][0,9]#


Steps of the algorithm:

  1. Read 256 bytes from the input stream.
  2. Create a new dictionary containing the positions of the longest repeated sub-strings.
  3. Write a new header containing the positions, followed by the uncompressed data.
  4. Go back to 1.

The input is read in chunks of 256 bytes. From each chunk, a new dictionary is created.

Example:
{
  9  => [0, 6],
  10 => [1, 5],
  11 => [2, 4],
  15 => [0, 9],
  16 => [1, 8],
  17 => [2, 7],
  18 => [3, 6],
  19 => [4, 5],
  20 => [5, 4],
}

The keys represent the start positions of the repeated sub-strings, each pointing at the position of the longest sub-string seen before, and its length. The minimum length of a sub-string must be at least 4 bytes long, otherwise the compression it's not efficient because the position specifications would occupy more space than the raw-uncompressed sub-string.

To actually compress the input, we have to iterate over each position and check whether the current position exists in the dictionary. If it does exists, we store its duplicated position inside an array and increment the current position with the length of the sub-string. When it doesn't exists, we just append the current byte to an $uncompressed string.

For the above input, only two values are used: [0, 6] and [0, 9] at position 9 and 15, respectively:
  • TOBEORNOT[0,6]TOBEOR[0,9]TOBEORNOT#

For being able to decompress the output safely, the positions have to be written in a specific format:

/----------------------------------------------------------\
| a byte representing the length of the uncompressed input |
\----------------------------------------------------------/
/----------------------------------------------------------\
| a byte representing the number of compressed sub-strings |
\----------------------------------------------------------/
/----------------------------------------------------------\
| all the sub-string position numbers converted into bytes |
\----------------------------------------------------------/
/----------------------------------------------------------\
| the uncompressed bytes                                   |
\----------------------------------------------------------/

Example:
/------------------\
|        9         |
\------------------/
/------------------\
|        2         |
\------------------/
/------------------\
| [9 0 6],[15 0 9] | 
\------------------/
/------------------\
|    TOBEORNOT#    |
\------------------/

Decompression

To recreate the original data, the decompress() function reads the header specifications and starts the decompression based on this information.

A new dictionary is created:
{
   9 => [0, 6],
  15 => [0, 9],
}

The uncompressed data is read and stored inside a variable. Then, for each position, it checks the dictionary with the current position, fetching and appending the sub-string to the $decompressed string if the current position exists inside the dictionary.

 ___________[0,9]__________
/ * * * * * * * *          \
| ____[0,6]_______         |
/ * * * * *       \        |
|                 |        |
T O B E O R N O T ^  ---   ^   ---   #
0 1 2 3 4 5 6 7 8 9 10-14 15  16-23 24

More examples

Input : 'a' x 256;
Output: aaaa[0,4][0,8][0,16][0,32][0,64][0,128]

Input : 'abcdefgh' x (256 / 8);
Output: abcdefgh[0,8][0,16][0,32][0,64][0,128]

Input : 'I am Sam! Sam I am! That Sam-I-am! That Sam-I-am!'
Output: I am Sam![4,4] [0,4]! That[9,4]-I-[16,15]am!

Conclusion

The described algorithm works very well on real-time input data streams. It handles each 256-byte chunk individually, writing the encoded data as soon as possible to the output stream, ready to be decoded without depending on the previous encoded data-chunks. But this comes with a price: by not depending on the previous chunks, the same token may be repeated multiple times across the encoded chunks. A more efficient approach would be to remember sub-strings from more chunks, up to 256, but, again, there is another price: this will use more memory and the encoded chunks will depend on one another, reducing the algorithm's flexibility.

Code

Update

Actually, the described algorithm is quite rudimentary and does not achieve a high compression ratio. In order to be competitive with existing methods, like gzip, we need more cleverness at encoding the literals, the indices and the lengths, using an entropy coding method, like Huffman Coding or Arithmetic Coding.

Several more attempts at implementing the LZ method, listed in ascending order of complexity:

Comments