Practical Compression of Nucleotide Databases

Hugh Williams
Department of Computer Science, RMIT, GPO Box 2476V, Melbourne 3001, Australia.
hugh@cs.rmit.edu.au

Justin Zobel
Department of Computer Science, RMIT, GPO Box 2476V, Melbourne 3001, Australia.
jz@cs.rmit.edu.au

Abstract

Nucleotide databases are collections of DNA sequences. Search techniques for DNA databases inspect a significant proportion of the database in response to each query, so that efficient query evaluation requires that, if the database is compressed, the decompression process be extremely fast. We investigate the entropy of nucleotide data to set a benchmark for compression, then consider two possible mechanisms for compression of nucleotide databases. One, based on Huffman coding, gives compression performance close to the theoretical minimum but, with a decoding rate of at most 1 megabase per second, is somewhat slow. The other, a coding scheme devised expressly for nucleotide data, also gives excellent compression and decodes at up to six megabases per second.

Conference Home Page