-
Notifications
You must be signed in to change notification settings - Fork 21
How FemtoZip Works
FemtoZip represents two natural extensions of gzip and gzip like compression schemes. In order to understand it, it is easiest to first understand how gzip/DEFLATE algorithms work
A great overview is provided here: http://www.gzip.org/deflate.html. But the short story is gzip does two things:
-
Look for repeated substrings in the input stream. If any string is seen that has already occurred in the previous 32k (and it is at least 3 chars in length), it will be replaced by a relative reference to the previous occurrence <distance, length>. For example
mississippi
becomesmiss<-3,4>ppi
(note the <> notation in the string is just to show the substitution, it is not expressed as ascii). This example is a bit exotic because it shows that all characters referenced by a <distance, length> pair may not exist. But by the time they are needed they will! This allows gzip to get the effect of run length encoding. For example 'aaaaaaaaaa' becomesa<-1,9>
. -
The second thing gzip does is take the resulting stream of literals (say
miss
in themississippi
example), as well as the stream of distances, lengths, and huffman encode them, so common symbols (like the letter 'e' in english) use fewer than 8 bits. The details of the huffman coding scheme are fairly involved, in fact two huffman trees are used, one which encodes literal and lengths, and one which encodes distances.
With the above background on gzip you can see why it doesn't work on small documents. First off, a repeated substring has to be seen at least twice to get any compression. For documents that are similar to each other, but don't have any internal similarity, step 1 does nothing. Step 2 is a problem because a Huffman tree that is derived from the actual data will compress the data well, but whose definition has to be written into the stream so the receiver knows how to decode. For small documents gzip will either just stick with a built in Huffman tree (which is unlikely to be optimal for the data), or take a significant cost in storing that tree in the stream.
FemtoZip addresses both shortcomings by doing the following:
-
Analyze a sample set of documents and derive a set of popular substrings. These substrings are packed into a dictionary and used to bootstrap the substring repetition detector. In effect the compressor starts off with some history to compare against. zlib actually supports starting compression/decompression off with an initial dictionary via deflateSetDictionary() and inflateSetDictionary(), but those APIs have some overhead (for example in the inflate case, the dictionary has to be hashed before compressing each document, which is a significant performance hit for small documents). Specifying a dictionary to zlib causes a checksum of the dictionary to be written to the stream, which can be a significant cost for small documents. Also zlib has no facility for computing a dictionary given sample data.
-
FemtoZip uses a 64k dictionary, vs the historical 32k limit in gzip.
-
An optimal Huffman tree is computed based on the sample documents, and is stored once in the model vs per document.
-
FemtoZip writes no header bytes or checksums to the stream.
The following uses an example to walk through the entire FemtoZip process (building a model, compressing, and decompressing), with references to the underlying C++ code. Before reading you should complete the Quick Hands on Tutorial, and read through How FemtoZip works. The example driver code referenced below can be found in the exampleCPlusPlus
function within fztests.cpp.
In our example we seek to efficiently compress urls. First we need to build a suitable model based a sample set of urls (our training documents). In reality you would use a large number of urls (100's or 1000's), but for our example we will train on only three urls:
http://espn.com
http://www.google.com
http://www.stanford.edu
Internally, the C++ library is presented with these documents via the DocumentList
abstraction. The DocumentList class is a simple container abstraction that represents a collection of byte buffers. There are currently 2 interesting subclasses of DocumentList: CStringDocumentList
(an array of null terminated C strings held in memory), and FileDocumentList
(an array of file paths whose contents are read on demand). The following code builds a model for these three sample documents (see exampleCPlusPlus() in fztests.cpp):
CStringDocumentList docs("http://espn.com",
"http://www.google.com",
"http://www.stanford.edu", NULL);
CompressionModel *model = new FemtoZipCompressionModel();
model->build(docs);
model->build(docs)
does two things:
-
Builds a well chosen dictionary of common substrings seen in the sample documents
-
Builds a Huffman coding (really a set of codings) that are suitable for encoding the documents after they have undergone substring extraction.
The dictionary is built within CompressionModel::buildDictionaryIfUnspecified
(CompressionModel is the base class of FemtoZipCompressionModel). It in turn relies on DictionaryOptimizer
. DictionaryOptimizer takes all of the sample documents and concatenates them into a single buffer (in this case: "http://espn.comhttp://www.google.comhttp://www.stanford.edu"), and then generates a suffix array and longest common prefix table (lcp). You can read more about these data structures here. In short, for a string "S" of length N, the suffix array "SA" is an integer array of length N in which each entry SA[i] is interpreted as a substring of S starting at SA[i]. SA is sorted alphabetically. The lcp is also an integer array of length N such that lcp[i] represents the number of leading characters SA[i] has in common with SA[i-1]. For our example data, the full suffix array, lcp, and associated strings are:
i SA lcp
0 59 0
1 11 0 .comhttp://www.google.comhttp://www.stanford.edu
2 32 15 .comhttp://www.stanford.edu
3 55 1 .edu
4 25 1 .google.comhttp://www.stanford.edu
5 46 1 .stanford.edu
6 5 0 //espn.comhttp://www.google.comhttp://www.stanford.edu
7 20 2 //www.google.comhttp://www.stanford.edu
8 41 6 //www.stanford.edu
9 6 1 /espn.comhttp://www.google.comhttp://www.stanford.edu
10 21 1 /www.google.comhttp://www.stanford.edu
11 42 5 /www.stanford.edu
12 4 0 ://espn.comhttp://www.google.comhttp://www.stanford.edu
13 19 3 ://www.google.comhttp://www.stanford.edu
14 40 7 ://www.stanford.edu
15 49 0 anford.edu
16 12 0 comhttp://www.google.comhttp://www.stanford.edu
17 33 14 comhttp://www.stanford.edu
18 54 0 d.edu
19 57 1 du
20 31 0 e.comhttp://www.stanford.edu
21 56 1 edu
22 7 1 espn.comhttp://www.google.comhttp://www.stanford.edu
23 51 0 ford.edu
24 29 0 gle.comhttp://www.stanford.edu
25 26 1 google.comhttp://www.stanford.edu
26 0 0 http://espn.comhttp://www.google.comhttp://www.stanford.edu
27 15 7 http://www.google.comhttp://www.stanford.edu
28 36 11 http://www.stanford.edu
29 30 0 le.comhttp://www.stanford.edu
30 14 0 mhttp://www.google.comhttp://www.stanford.edu
31 35 12 mhttp://www.stanford.edu
32 10 0 n.comhttp://www.google.comhttp://www.stanford.edu
33 50 1 nford.edu
34 28 0 ogle.comhttp://www.stanford.edu
35 13 1 omhttp://www.google.comhttp://www.stanford.edu
36 34 13 omhttp://www.stanford.edu
37 27 1 oogle.comhttp://www.stanford.edu
38 52 1 ord.edu
39 3 0 p://espn.comhttp://www.google.comhttp://www.stanford.edu
40 18 4 p://www.google.comhttp://www.stanford.edu
41 39 8 p://www.stanford.edu
42 9 1 pn.comhttp://www.google.comhttp://www.stanford.edu
43 53 0 rd.edu
44 8 0 spn.comhttp://www.google.comhttp://www.stanford.edu
45 47 1 stanford.edu
46 48 0 tanford.edu
47 2 1 tp://espn.comhttp://www.google.comhttp://www.stanford.edu
48 17 5 tp://www.google.comhttp://www.stanford.edu
49 38 9 tp://www.stanford.edu
50 1 1 ttp://espn.comhttp://www.google.comhttp://www.stanford.edu
51 16 6 ttp://www.google.comhttp://www.stanford.edu
52 37 10 ttp://www.stanford.edu
53 58 0 u
54 24 0 w.google.comhttp://www.stanford.edu
55 45 2 w.stanford.edu
56 23 1 ww.google.comhttp://www.stanford.edu
57 44 3 ww.stanford.edu
58 22 2 www.google.comhttp://www.stanford.edu
59 43 4 www.stanford.edu
As you scan down the lcp, conceptually you can see that long, contiguous runs of large values represent substrings that are repeated in the sample input. For example we can see that entries i = [27,28] corresponds to the http prefix, which occurs three times. So the algorithm for determining repeated substrings (DictionaryOptimizer::computeSubstrings
) (and how often they are repeated) involves scanning the lcp and tracking the change in lcp value. When the lcp increases, we know we are encountering new substrings that occur more than once in the corpus. When the lcp decreases we know we are coming to the end of the run for those substrings. For example at i=27, we go from lcp[26] of 0, to lcp[27] of 7. So we know the prefix "http://" occurs at least twice, as do all substrings "h", "ht", "htt", "http", "http:", and "http:/". Then for i=28, we increase again to lcp[28] of 11, which means the prefix "http://www." occurs at least twice (as well as substrings "http://w", ""http://ww", and "http://www"). At lcp[29], we go down to 0, which means all of those previously mentioned strings can be tallied up ("http://" and all sub variants occurred 3 times, and "http://www." and sub variants occurred 2 times). Each substring and its associated count is modeled as a Substring
and stored in a substrings
vector on the DictionaryOptimizer.
There are a few minor optimizations to this process.
-
The count associated with each Substring is not the total number of occurrences in the raw buffer, but the number of unique documents that this substring occurs in. The fact is that given the way substring extraction works in femtozip (and gzip for that matter), a string that occurs multiple times in a document gets no more benefit from inclusion in the dictionary then if it occurs once. In other words given two documents "garrick garrick garrick toubassi", "toubassi", the string toubassi is far more valuable in a shared dictionary.
-
Another side effect of concatenating the documents together is it might cause the algorithm to value strings which don't occur at all within the corpus. For example in the string "http://espn.comhttp://www.google.comhttp://www.stanford.edu" the algorithm might decide that ".comhttp://www" is a high value string, but it never occurs in any document. We exclude such strings.
-
We make sure not to add clearly redundant substrings to the array at this stage. For example if "http://" occurs 3 times, and "http:/" occurs 3 times, we only need to add "http://". However note that "http://www." gets added separately because it has a different occurrence rate. The redundancy will be resolved during the "packing" phase below.
So now we have a large list of substrings and their associated counts. We need to sort them based on their value to us when included in the dictionary. The value of a substring in our dictionary is based on how long it is and how often it occurs. "http://www." occurs twice, but "http://" occurs 3 times. Also when comparing these two strings we have to account for the fact that when a substring is extracted it doesn't remove it entirely, the substring reference needs to be encoded. The predicted overhead of encoding a substring reference is 3 bytes. So "http://" occurs 3 times but will yield only (strlen("http://") - 3) net chars of savings per extraction. Furthermore longer substrings take up more space in the dictionary and are potentially less desirable than shorter ones. For example if the string "george washington carver" appeared twice, you might assign it a value of 2*(24-3) = 42. However if the string "george" on its own occurred 14 times, it would have the same score: 14*(6-3) = 42 but is clearly more desirable because it takes up less space in the dictionary and thus leaves room for other strings. For that reason the overall score associated with a substring is normalized to represent the predicted per character savings. Thus the formula:
substring score = number of occurrences * (substring length - 3 bytes of overhead) / substring length
This intuitively makes sense because if the encoding overhead was zero instead of three the equation would boil down to simply the number of occurrences. This set of substrings is sorted in descending order by score. So the highest value strings will appear first. In our working example, the set of substrings sorted by score is:
i score string
0 171 http://
1 150 ttp://
2 145 http://www.
3 140 ttp://www.
4 133 tp://www.
5 125 p://www.
6 120 tp://
7 114 ://www.
8 100 //www.
9 80 /www.
10 75 p://
11 50 www.
12 50 .com
Now that we have the list of substrings sorted by desirability, we proceed with "packing" them (DictionaryOptimizer::pack
) into a dictionary. We first iterate through the list selecting strings, and comparing them against previously selected strings to make sure they aren't already covered (i.e. a substring of a previously selected string), and if not, add them, and remove any previously selected substrings which are themselves substrings of this string. This is how we end up selecting only "http://www." and not also selecting "http://". The result of this pruning process for our example is:
i score string
0 145 http://www.
1 50 .com
We then line these strings up end to end, attempting to take advantage of serendipitous commonality between neighboring strings prefixes/suffixes (e.g. if we lay out "espn.com" next to "computer", 3 bytes will be saved and it will be stored as "espn.computer" instead of "espn.comcomputer"). In our case no such overlap exists, so we end up with a final dictionary of:
.comhttp://www.
During the compression process, the above dictionary will be used to extract common substrings from the input data. If we compress "http://www.shopstyle.com", we will get something like: {offset:-11,length:11}shopstyle{offset:-35,length:4}
. There is still lots of work to do to translate this stream of string references and character literals into a compressed binary format. We want to take advantage of symbol frequency to encode the remaining literals, but we still need to capture and efficiently encode the substring references. This information is encoded in a symbol stream as follows:
- Literal bytes which are represented as raw values of 0-255, and are encoded using a specific Huffman coding.
- String references which are composed of two parts. A length, limited to 8-bits is encoded as the length + 255 (to avoid ambiguity with a literal byte), and this value is encoded with the same Huffman coding used above. An offset, limited to 16 bits and is encoded as 4 nibbles (4 bit words), each nibble uses its own Huffman coding.
That's right. There are actually 5 separate Huffman codings that are calculated. One for the literal/length (values from 0-512), and one each for each of the 4 offset nibbles. Breaking the offset into 4 nibbles allows for a more optimal encoding because the high nibbles are often zero. Different combinations were tested empirically and this was found to be the best.
Getting back to our concrete example, the string "http://www.shopstyle.com" will be encoded as:
266 length 11 + 255, using the 'literalLength' Huffman coding
11 least significant nibble of the offset -11, encoded using the 'offsetNibble0' Huffman coding.
0 next least significant nibble of the offset -11, encoded using the 'offsetNibble1' Huffman coding.
0 next least significant nibble of the offset -11, encoded using the 'offsetNibble2' Huffman coding.
0 most significant nibble of the offset -11, encoded using the 'offsetNibble3' Huffman coding.
115 's' using the 'literalLength' Huffman coding
104 'h' using the 'literalLength' Huffman coding
111 'o' using the 'literalLength' Huffman coding
112 'p' using the 'literalLength' Huffman coding
115 's' using the 'literalLength' Huffman coding
116 't' using the 'literalLength' Huffman coding
121 'y' using the 'literalLength' Huffman coding
108 'l' using the 'literalLength' Huffman coding
101 'e' using the 'literalLength' Huffman coding
259 length 4 + 255 using the 'literalLength' Huffman coding
3 least significant nibble of the offset -35, encoded using the 'offsetNibble0' Huffman coding.
2 next least significant nibble of the offset -35, encoded using the 'offsetNibble1' Huffman coding.
0 next least significant nibble of the offset -35, encoded using the 'offsetNibble2' Huffman coding.
0 most significant nibble of the offset -35, encoded using the 'offsetNibble3' Huffman coding.
Given the above goals, we need to compute these 5 Huffman code tables. This process is driven largely by CompressionModel::buildEncodingModel
. This method makes use of a SubstringPacker
, which, given a dictionary and one or more strings, performs substring extraction (replacing repeated substrings with offset/lengths), and outputs a set of events (literal or string reference) to a SubstringPacker::Consumer
. We then connect a special implementation of SubstringPacker::Consumer (FemtoZipHuffmanModelBuilder
) which simply builds histograms for each symbol passed to it. All it does is increment appropriate histograms when it receives literals or references (literals and lengths increment the literalLength histogram, and offsets increment each of the 4 offsetNibble histograms). All training docs are fed through the SubstringPacker, so a suitable sampling of literals, lengths, and offsets are sampled. After that the histograms are passed to an instance of FrequencyHuffmanModel
, which computes the Huffman coding (mapping of input symbols to output bits).
At this point the model is built and can be saved.
The model file is written/read using DataOutput/DataInput
. At the time of this writing (3/13/12), the file contents are as follows:
- A string representing the model type, preceded by a 4 byte, little-endian ordered length. In most cases it will be
\x08\x00\x00\x00FemtoZip
. - A 4 byte file format version, currently zero (
\x00\x00\x00\x00
). - The dictionary data, preceded by a 4 byte little-endian ordered length. In our example above its '\x0f\x00\x00\x00.comhttp://www.'.
- A single byte (1 or 0) indicating whether the Huffman code tables follow. I can't remember why this would ever be zero, but I clearly went out of my way to support this.
- 5 FrequencyHuffmanModels are written out via
FrequencyHuffmanModel::save
(literalLengthModel, offsetNibble0Model, offsetNibble1Model, offsetNibble2Model, offsetNibble3Model). See the code for more details.
Now that we have a model, we can compress documents. This process is done via FemtoZipCompressionModel::compress
, for example:
const char *buf = "http://www.shopstyle.com";
ostringstream out;
model->compress(buf, strlen(buf), out);
string compressed = out.str();
The aforementioned SubstringPacker
is used to perform string reference extraction. It is initialized with the model's dictionary, and fed the target string "http://www.shopstyle.com". It is handed the FemtoZipCompressionModel to serve as the consumer of the encodeLiteral and encodeSubstring events. The FemtoZipCompressionModel then turns around and encodes those symbols (literals directly, lengths as 255+length, and offsets a nibble at a time) to an instance of HuffmanEncoder<FemtoZipHuffmanModel>
. HuffmanEncoder uses the FemtoZipHuffmanModel to map the symbol to an instance of Codeword
, and the resulting Codeword is asked to write itself to a BitOutput
. FemtoZipHuffman model keeps track of the values that pass through it and delegates to the appropriate underlying FrequencyHuffmanModel (of the 5 used). At the end of the process the EOF symbol is encoded.
Decompressing a document proceeds in reverse, driven from FemtoZipCompressionModel::decompress
, for example:
ostringstream out2;
model->decompress(compressed.c_str(), compressed.length(), out2);
string decompressed = out2.str();
First we need to decode from raw Huffman encoded bits into a symbol stream using HuffmanDecoder<FemtoZipHuffmanModel>
. The HuffmanDecoder is initialized with a raw byte buffer that represents the compressed document. By repeatedly calling HuffmanDecoder::decodeSymbol
, the raw symbol stream can be processed. This symbol stream is then reassembled into literals (symbols < 256) or substring references (a length as indicated by a symbol > 255, followed by 4 symbols which represents the 4 nibbles of the offset). The literals and string references are fed to a SubstringUnpacker
, which is initialized with the dictionary, and can turn
{offset:-11,length:11}shopstyle{offset:-35,length:4}
into "http://www.shopstyle.com"