Practical Compression of Nucleotide Databases
Hugh Williams
Department of Computer Science,
RMIT, GPO Box 2476V,
Melbourne 3001, Australia.
hugh@cs.rmit.edu.au
Justin Zobel
Department of Computer Science,
RMIT, GPO Box 2476V,
Melbourne 3001, Australia.
jz@cs.rmit.edu.au
Abstract
Nucleotide databases are collections of DNA sequences.
Search techniques for DNA databases inspect a significant
proportion of the database in response to each query, so
that efficient query evaluation requires that, if the
database is compressed,
the decompression process be extremely fast.
We investigate the entropy of
nucleotide data to set a benchmark for compression,
then consider two possible mechanisms for
compression of nucleotide databases.
One, based on Huffman coding, gives compression performance
close to the theoretical minimum but, with a decoding rate
of at most 1 megabase per second, is somewhat slow.
The other, a coding scheme devised expressly for nucleotide
data, also gives excellent compression and decodes at up to
six megabases per second.
Conference Home Page