In-Situ Generation of Compressed Inverted Files
Alistair Moffat
Department of Computer Science,
The University of Melbourne,
Parkville 3052, Australia.
Timothy A.H. Bell
Department of Computer Science,
The University of Melbourne,
Parkville 3052, Australia.
Status
Journal of the American Society for Information Science,
46(7):537-550, August 1995.
Abstract
An inverted index stores, for each term that appears in a collection of
documents, a list of document numbers containing that term. Such an
index is indispensable when Boolean or informal ranked queries are to
be answered. Construction of the index is, however, a non-trivial
task. Simple methods using in-memory data structures cannot be used
for large collections because they require too much random access
storage, and traditional disk-based methods require large amounts of
temporary file space. This paper describes a new indexing algorithm
designed to create large compressed inverted indexes in-situ. It makes
use of simple compression codes for the positive integers, and an
in-place external multi-way mergesort. The new technique has been used
to invert a two-gigabyte text collection in under four hours, using
less than forty megabytes of temporary disk space, and less than twenty
megabytes of main memory.