Efficient and Effective Higher Order Proximity Modeling


Xiaolu Lu
School of Computer Science and Information Technology, RMIT University, Victoria 3001, Australia.

Alistair Moffat
Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia.

Shane Culpepper
School of Computer Science and Information Technology, RMIT University, Victoria 3001, Australia.


Status

Proc. ACM Int. Conf. on Theory of Information Retrieval, Newark, Delaware, September 2016, pages 21-30.

Abstract

Bag-of-words retrieval models are widely used, and provide a robust trade-off between efficiency and effectiveness. These models often make simplifying assumptions about relations between query terms, and treat term statistics independently. However, query terms are rarely independent, and previous work has repeatedly shown that term dependencies can be critical to improving the effectiveness of ranked retrieval results. Among all term-dependency models, the Markov Random Field (MRF) [Metzler and Croft, SIGIR, 2005] model has received the most attention in recent years. Despite clear effectiveness improvements, these models are not deployed in performance-critical applications because of the potentially high computational costs. As a result, bigram models are generally considered to be the best compromise between full term dependence, and term-independent models such as BM25.

Here we provide further evidence that term-dependency features not captured by bag-of-words models can reliably improve retrieval effectiveness. We also present a new variation on the highly-effective MRF model that relies on a BM25-derived potential. The benefit of this approach is that it is built from feature functions which require no higher-order global statistics. We empirically show that our new model reduces retrieval costs by up to 60%, with no loss in effectiveness compared to previous approaches.


Full text

http://dx.doi.org/10.1145/2970398.2970404.