Parsing Strategies for BWT Compression


R. Yugo Kartono Isal
Department of Computer Science and Software Engineering, The University of Melbourne, Victoria 3010, Australia.

Alistair Moffat
Department of Computer Science and Software Engineering, The University of Melbourne, Victoria 3010, Australia.


Status

Proc. IEEE Data Compression Conference, Snowbird, Utah, March 2001, 429-438.

Abstract

Block-sorting is an innovative compression mechanism introduced in 1994 by Burrows and Wheeler. Block-sorting compression is usually described as involving three steps: permuting the input one block at a time through the use of the Burrows-Wheeler Transform (BWT); applying a Move-To-Front (MTF) transform to each of the permuted blocks; and then entropy coding the output with a Huffman or arithmetic coder. In this paper we prepend a fourth transformation to this sequence: parsing. In the BWT implementations that have been considered to date the unit of transmission has been taken to be the ASCII character. But there is no particular reason why this should be so, and a range of other strategies can be used to construct the sequence of symbols that is fed into the BWT process. We consider some of the issues associated with making this change, and show that in some situations the introduction of a simple parsing stage allows improved compression to be obtained compared to an otherwise equivalent character-based BWT implementation. We also describe an MTF-like ranking transformation that caters better to large-alphabet situations than does the strict MTF rule used in conventional BWT implementations.