Index Compression Using 64-Bit Words
Vo Ngoc Anh
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
Alistair Moffat
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
Status
Software -- Practice & Experience, 40(2):131-147, February 2010.
Abstract
Modern computers typically make use of 64-bit words as the
fundamental unit of data access.
However the decade-long migration from 32-bit architectures has
not been reflected in compression technology, because of a widespread
assumption that effective compression techniques operate in terms of
bits or bytes, rather than words.
Here we demonstrate that the use of 64-bit access units,
especially in connection with word-bounded codes, does indeed provide
opportunity for improving compression performance.
In particular, we extend several 32-bit word-bounded coding schemes
to 64-bit operation and explore their uses in information
retrieval applications.
Our results show that the Simple-8b approach, a 64-bit
word-bounded code, is an excellent self-skipping code, and has a clear
advantage over its competitors in supporting fast query evaluation
when the data being compressed represents the inverted index for a
large text collection.
The advantages of the new code also accrue on 32-bit architectures,
and for all of Boolean, ranked, and phrase queries; which means that
it can be used in any situation.
Full text
http://dx.doi.org/10.1002/spe.948.