Space-Efficient Construction of Optimal Prefix Codes

Alistair Moffat
Department of Computer Science, The University of Melbourne, Parkville 3052, Australia.

Andrew Turpin
Department of Computer Science, The University of Melbourne, Parkville 3052, Australia.

Jyrki Katajainen
Department of Computer Science, University of Copenhagen, Universitetsparken 1, DK-2100 Copenhagen East, Denmark.


Presented at the 1995 IEEE Data Compression Conference, Snowbird, Utah, March 1995.


Large static text databases can be effectively compressed by coupling a semi-static word-based model with canonical Huffman coding. The word-based model yields relatively good compression, and the use of canonical Huffman coding on long tokens results in fast decoding rates. The high speed of the method results from the manipulation of codewords as machine-level integers, and the fact that each decoded token results in several bytes of output.

There is, however, one problem with this combination, and that is the large size of the resulting set of symbols and the consequent risk of codeword overflow. For most current computers these implementation techniques are possible only if no codewords are longer than 32 bits. Moreover, 33-bit codewords can arise on alphabets of as few as 34 distinct symbols and about 13 million symbols in total~\cite[page 347]{wmb94:mg}.

We have been working with a 2~Gb collection of newspaper articles, containing 350 million word appearances and nearly one million distinct words. On this {\trec} collection the longest codeword generated by the standard Huffman algorithm~\cite{huf52} is already 30 bits. Moreover, the collection has recently grown to 3 Gb, and further extension seems likely. Rather than completely redesign our implementation to allow (say) 64-bit codewords manipulated as double integers (and rather than buy new hardware) we have chosen to make use of {\em length-limited\/} codes, where a bound $L$ is placed on the maximum permitted length of any codeword---in our case, $L=32$---and then optimal codewords subject to this constraint are constructed.

Larmore and Hirschberg~\cite{lh90b} described the first computationally tractable algorithm for optimal length-limited coding in 1990. Their ``recursive package-merge'' method requires $O(nL)$ time and $O(n)$ space to calculate an optimal $L$-limited code for an alphabet of $n$ symbols. (They also described a simpler implementation that requires $O(nL)$ time and space; as will be seen below, this latter algorithm is the starting point for our development.) Unfortunately, the constant factors involved in the asymptotic analysis are higher than might be hoped, and the recursive package-merge algorithm requires $16n$ words, or about 65~Mb of memory to calculate a code for the {\trec} word distribution. By way of contrast, Huffman's algorithm for unrestricted codes can be implemented to require just $2n$ words of memory~\cite{wmb94:mg}.

Here we show that the use of the ``lazy'' list processing technique from the world of functional languages (see, for example, Holyer~\cite{holyer91}) allows, under certain conditions, the package-merge algorithm to be executed in much less space than is indicated by the $O(nL)$ space worst-case bound. For example, the revised implementation generates a 32-bit limited code for the {\trec} distribution within 15~Mb of memory. We also show how a second observation---that in large-alphabet situations it is often the case that there are many symbols with the same frequency---can be exploited to further reduce the space required, for both unlimited and length-limited coding. This second improvement allows calculation of an optimal length-limited code for the {\trec} word distribution in under 8~Mb of memory; and calculation of an unrestricted Huffman code in under 1~Mb of memory.