Exploring the Magic of WAND
Matthias Petri
School of Computer Science and Information Technology,
RMIT University,
Victoria 3001, Australia
and
Department of Computing and Information Systems,
The University of Melbourne,
Victoria 3010, Australia.
J. Shane Culpepper
School of Computer Science and Information Technology,
RMIT University,
Victoria 3001, Australia.
Alistair Moffat
Department of Computing and Information Systems,
The University of Melbourne,
Victoria 3010, Australia.
Status
Proc. 18th Australasian Document Computing Symp.,
Brisbane, December 2013, pages 58-65.
Abstract
Web search services process thousands of queries per second, and
filter their answers from collections containing very large amounts
of data.
Fast response to queries is a critical service expectation.
The well-known WAND processing strategy is one way of reducing the
amount of computation necessary when executing such a query.
The value of WAND has now been validated in a wide range of
studies, and has become one of the key baselines against which all
new top-k processing algorithms are benchmarked.
However, most previous implementations of WAND-based retrieval
approaches have been in the context of the BM25 Okapi similarity
scoring regime.
Here we measure the performance of WAND in the context of the
alternative Language Model similarity score computation, and find
that the dramatic efficiency gains reported in previous studies are
no longer achievable.
That is, when the primary goal of a retrieval system is to maximize
effectiveness, WAND is relatively unhelpful in terms of attaining
the secondary objective of maximizing query throughput rates.
However, the BlockMax-WAND algorithm does in fact help reducing the
percentage of postings to be scored, but with additional
computational overhead.
We explore a variety of trade-offs between scoring metric and
processing regime and present new insight into how score-safe
algorithms interact with rank scoring.
Errata
Algorithm 1 contains a small error, at line 17 it should say:
17:
score_limit ← tmp_s_lim
Indeed, we could have saved a line of pseudo-code by deleting line
17 completely, and changing these two lines:
13:
score_limit ← score_limit + U[pivot]
14:
if score_limit > θ then
since score_limit doesn't get used in any way once the loop ends.
Oh well, only took eight years to discover this! (16 March 2022).
Full text
http://dx.doi.org/10.1145/2537734.2537744