Term-Frequency Surrogates in Text Similarity Computations
Stefan Pohl
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
Alistair Moffat
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
Status
Proc. 13th Australasian Document Computing Symposium,
Hobart, Australia, December 2008, pages 3-10.
Abstract
Inverted indexes on external storage perform best when accesses are
ordered and data is read sequentially, so that seek times are
minimized.
As a consequence, the various items required to compute Boolean,
ranked and phrase queries are often interleaved in the inverted
lists.
While suitable for query types in which all items are required, this
arrangement has the drawback that other query types -- notably pure
ranked queries and conjunctive Boolean queries -- do not require
access to word position information, and that component of each
posting must be bypassed when these queries are being handled.
In this paper we show that the term frequency component of each
posting can be completely replaced by a surrogate that allows
skipping of positional information interleaved in inverted lists, and
obtain significant speedups in ranked query execution without
increasing the index size, and without harming retrieval
effectiveness.
We also explore two methods of reconstituting approximations to the
original term frequencies that can be employed if use of the
surrogate is deemed too risky.
Our simple improvement can thus be used with all ranking functions
that make use of term frequencies.