In-Situ Generation of Compressed Inverted Files


Alistair Moffat
Department of Computer Science, The University of Melbourne, Parkville 3052, Australia.

Timothy A.H. Bell
Department of Computer Science, The University of Melbourne, Parkville 3052, Australia.


Status

Journal of the American Society for Information Science, 46(7):537-550, August 1995.

Abstract

An inverted index stores, for each term that appears in a collection of documents, a list of document numbers containing that term. Such an index is indispensable when Boolean or informal ranked queries are to be answered. Construction of the index is, however, a non-trivial task. Simple methods using in-memory data structures cannot be used for large collections because they require too much random access storage, and traditional disk-based methods require large amounts of temporary file space. This paper describes a new indexing algorithm designed to create large compressed inverted indexes in-situ. It makes use of simple compression codes for the positive integers, and an in-place external multi-way mergesort. The new technique has been used to invert a two-gigabyte text collection in under four hours, using less than forty megabytes of temporary disk space, and less than twenty megabytes of main memory.