Term-Frequency Surrogates in Text Similarity Computations


Stefan Pohl
Department of Computer Science and Software Engineering, The University of Melbourne, Victoria 3010, Australia.

Alistair Moffat
Department of Computer Science and Software Engineering, The University of Melbourne, Victoria 3010, Australia.


Status

Proc. 13th Australasian Document Computing Symposium, Hobart, Australia, December 2008, pages 3-10.

Abstract

Inverted indexes on external storage perform best when accesses are ordered and data is read sequentially, so that seek times are minimized. As a consequence, the various items required to compute Boolean, ranked and phrase queries are often interleaved in the inverted lists. While suitable for query types in which all items are required, this arrangement has the drawback that other query types -- notably pure ranked queries and conjunctive Boolean queries -- do not require access to word position information, and that component of each posting must be bypassed when these queries are being handled. In this paper we show that the term frequency component of each posting can be completely replaced by a surrogate that allows skipping of positional information interleaved in inverted lists, and obtain significant speedups in ranked query execution without increasing the index size, and without harming retrieval effectiveness. We also explore two methods of reconstituting approximations to the original term frequencies that can be employed if use of the surrogate is deemed too risky. Our simple improvement can thus be used with all ranking functions that make use of term frequencies.

Full text