Block Sorting Text Compression


Peter Fenwick
Department of Computer Science, The University of Auckland, Private Bag 92019, Auckland, New Zealand
peter-f@cs.auckland.ac.nz


Abstract

A recent development in text compression is a "block sorting" algorithm which permutes the input text according to a special sort procedure and then processes the permuted text with Move-To-Front and a final statistical compressor. The technique is fast, with a compression performance ranking it among the best of the known compressors.

This paper describes work on the block sorting algorithm, especially establishing its relation to other compressors and attempting to improve its compression performance. It is already known to be a form of statistical compressor with unbounded contexts; we show that the contexts are so completely restructured by the sorting that many standard file compression techniques are no longer appropriate. Various approaches are investigated in an attempt to improve the compression, most of which involve a hierarchy of coding models to allow fast adaptation to local contexts. The better coding techniques include one derived from work of Shannon in 1951 in establishing the entropy of English text, while the best employs a novel model especially designed for skewed probability distributions.

The work in this paper confirms block-sorting as a viable text compression technique, with a compression approaching that of the currently best compressors while being much faster than many other compressors of comparable performance.


Conference Home Page