Parsing Strategies for BWT Compression
R. Yugo Kartono Isal
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
Alistair Moffat
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
Status
Proc. IEEE Data Compression Conference,
Snowbird, Utah, March 2001, 429-438.
Abstract
Block-sorting is an innovative compression mechanism introduced in
1994 by Burrows and Wheeler.
Block-sorting compression is usually described
as involving three steps: permuting the input
one block at a time through the use of the Burrows-Wheeler Transform
(BWT); applying a Move-To-Front (MTF) transform to each of the
permuted blocks; and then entropy coding the output with a Huffman or
arithmetic coder.
In this paper we prepend a fourth transformation to this sequence:
parsing.
In the BWT implementations that have been considered to date the
unit of transmission has been taken to be the ASCII character.
But there is no particular reason why this should be so, and a range
of other strategies can be used to construct the sequence of symbols
that is fed into the BWT process.
We consider some of the issues associated with making
this change, and show that in some situations the introduction of a
simple parsing stage allows improved compression to be obtained
compared to an otherwise equivalent character-based BWT
implementation.
We also describe an MTF-like ranking transformation that
caters better to large-alphabet situations than does the strict MTF
rule used in conventional BWT implementations.