Assessing Efficiency-Effectiveness Tradeoffs in Multi-Stage
Retrieval Systems Without Using Relevance Judgments
Charles L. A. Clarke
School of Computer Science,
University of Waterloo
Ontario N2L 3G1, Canada
Shane Culpepper
School of Computer Science and Information Technology,
RMIT University,
Victoria 3001, Australia.
Alistair Moffat
Department of Computing and Information Systems,
The University of Melbourne,
Victoria 3010, Australia.
Status
Information Retrieval J., 19(4):416-445, 2016.
Abstract
Large-scale retrieval systems are often implemented as a cascading
sequence of phases -- a first filtering step, in which a large set of
candidate documents are extracted using a simple technique such as
Boolean matching and/or static document scores; and then one or more
ranking steps, in which the pool of documents retrieved by the filter
is scored more precisely using dozens or perhaps hundreds of
different features.
The documents returned to the user are then taken from the head of
the final ranked list.
Here we examine methods for measuring the quality of filtering and
preliminary ranking stages, and show how to use these measurements to
tune the overall performance of the system.
Standard top-weighted metrics used for overall system evaluation are
not appropriate for assessing filtering stages, since the output is a
set of documents, rather than an ordered sequence of documents.
Instead, we use an approach in which a quality score is computed
based on the discrepancy between filtered and full evaluation.
Unlike previous approaches, our methods do not require relevance
judgments, and thus can be used with virtually any query set.
We show that this quality score directly correlates with actual
differences in measured effectiveness when relevance judgments are
available.
Since the quality score does not require relevance judgments, it can
be used to identify queries that perform particularly poorly for a
given filter.
Using these methods, we explore a wide range of filtering options
using thousands of queries, categorize the relative merits of the
different approaches, and identify useful parameter combinations.
Full text
http://dx.doi.org/10.1007/s10791-016-9279-1.
(Author version (PDF)).