Efficient Set Intersection for Inverted Indexing
J. Shane Culpepper
School of Computer Science and Information Technology,
RMIT University,
Victoria 3001, Australia.
Alistair Moffat
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
Status
ACM Trans. Information Systems, 29(1):1.1-1.25, December
2010.
Abstract
Conjunctive Boolean queries are a key component of modern information
retrieval systems, especially when web-scale repositories are being
searched.
A conjunctive query q is equivalent to a |q|-way intersection
over ordered sets of integers, where each set represents the
documents containing one of the terms, and each integer in each set
is an ordinal document identifier.
As is the case with many computing applications, there is tension
between the way in which the data is represented, and the ways in
which it is to be manipulated.
In particular, the sets representing index data for typical document
collections are highly compressible, but are processed using random
access techniques, meaning that methods for carrying out set
intersections must be alert to issues to do with access patterns and
data representation.
Our purpose in this paper is to explore these tradeoffs, by
investigating intersection techniques that make use of both
uncompressed "integer" representations, as well as compressed
arrangements.
We also propose a simple hybrid method that provides both compact
storage, and also faster intersection computations for conjunctive
querying than is possible even with uncompressed representations.
Full text
http://doi.acm.org/10.1145/1877766.1877767
.