Text Compression for Dynamic Document Databases


Alistair Moffat
Department of Computer Science, The University of Melbourne, Parkville 3052, Australia.

Justin Zobel
Department of Computer Science, RMIT, GPO Box 2476V, Melbourne 3001, Australia.

Neil Sharman
Department of Computer Science, The University of Melbourne, Parkville 3052, Australia.


Status

IEEE Trans. Knowledge and Data Engineering, 9(2):302-313, March/April 1997.

Abstract

For compression of text databases semi-static word-based methods provide good performance in terms of both speed and disk space, but two problems arise. First, the memory requirements for the compression model during decoding can be unacceptably high. Second, the need to handle document insertions means that the collection must be periodically recompressed if compression efficiency is to be maintained on dynamic collections. Here we show that with careful management the impact of both of these drawbacks can be kept small. Experiments with a word-based model and over 500~Mb of text show that excellent compression rates can be retained even in the presence of severe memory limitations on the decoder, and after significant expansion in the amount of stored text.