On the Cost of Extracting Proximity Features for Term-Dependency Models
Xiaolu Lu
School of Computer Science and Information Technology,
RMIT University,
Victoria 3001, Australia.
Alistair Moffat
Department of Computing and Information Systems,
The University of Melbourne,
Victoria 3010, Australia.
Shane Culpepper
School of Computer Science and Information Technology,
RMIT University,
Victoria 3001, Australia.
Status
Proc. 24th ACM CIKM Int. Conf. on Information and
Knowledge Management,
Melbourne, October 2015, pages 293-302.
Abstract
Sophisticated ranking mechanisms make use of term dependency features
in order to compute similarity scores for documents.
These features include exact phrase occurrences, and term proximity
estimates, in both cases building on the intuition that if multiple
query terms appear nearby each other, the document has increased
probability of being relevant to the query.
In this paper we examine the processes that are used to compute these
statistics.
Two distinct input structures can be used -- inverted files in which
term positional information is retained, and the "direct" files
constructed by some systems, in which each document is represented by
a sequence of preprocessed term identifiers.
Based on these two input modalities, a number of algorithms can be
employed to compute proximity statistics.
Until now, these algorithms have been described in terms of a single
set of query terms.
But similarity computations such as the Full Dependency Model involve
computation of proximity statistics for a collection of related term
sets.
We present a new approach in which such collections are processed
holistically in time that is much less than would be the case if each
subquery was evaluated independently.
The benefits of the new method are demonstrated by a comprehensive
experimental study.
Full text
http://dx.doi.org/10.1145/2806416.2806606.