On the Cost of Extracting Proximity Features for Term-Dependency Models

Xiaolu Lu
School of Computer Science and Information Technology, RMIT University, Victoria 3001, Australia.

Alistair Moffat
Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia.

Shane Culpepper
School of Computer Science and Information Technology, RMIT University, Victoria 3001, Australia.

Status

Proc. 24th ACM CIKM Int. Conf. on Information and Knowledge Management, Melbourne, October 2015, pages 293-302.

Abstract

Sophisticated ranking mechanisms make use of term dependency features in order to compute similarity scores for documents. These features include exact phrase occurrences, and term proximity estimates, in both cases building on the intuition that if multiple query terms appear nearby each other, the document has increased probability of being relevant to the query. In this paper we examine the processes that are used to compute these statistics. Two distinct input structures can be used -- inverted files in which term positional information is retained, and the "direct" files constructed by some systems, in which each document is represented by a sequence of preprocessed term identifiers. Based on these two input modalities, a number of algorithms can be employed to compute proximity statistics. Until now, these algorithms have been described in terms of a single set of query terms. But similarity computations such as the Full Dependency Model involve computation of proximity statistics for a collection of related term sets. We present a new approach in which such collections are processed holistically in time that is much less than would be the case if each subquery was evaluated independently. The benefits of the new method are demonstrated by a comprehensive experimental study.

Full text

http://dx.doi.org/10.1145/2806416.2806606.