Word-Based Text Compression using the Burrows-Wheeler Transform
Alistair Moffat
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
R. Yugo Kartono Isal
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
Status
Information Processing and Management, 41:1175-1192, 2005.
Combines material presented in preliminary form at the
2001 IEEE Data Compression Conference (by the two current authors)
and at the 2002 Australasian Computer Science Conference (written in
collaboration with a third author, A.
C.
H.
Ngai).
Abstract
Block-sorting is an innovative compression mechanism introduced in
1994 by Burrows and Wheeler.
It involves three steps: permuting the input one block at a time
through the use of the Burrows-Wheeler Transform (BWT); applying a
Move-To-Front (MTF) transform to each of the permuted blocks; and
then entropy coding the output with a Huffman or arithmetic coder.
Until now, block-sorting implementations have assumed that the input
message is a sequence of characters.
In this paper we extend the block-sorting mechanism to word-based
models.
We also consider other recency transformations, and are able to show
improved compression results compared to MTF and uniform
arithmetic coding.
For large files of text, the combination of word-based modeling, BWT,
and MTF-like transformations allows excellent compression
effectiveness to be attained within reasonable resource costs.
Full text
http://dx.doi.org/10.1016/j.ipm.2004.08.009
.