The Effect of Pooling and Evaluation Depth on IR Metrics
Xiaolu Lu
School of Computer Science and Information Technology,
RMIT University,
Victoria 3001, Australia.
Alistair Moffat
Department of Computing and Information Systems,
The University of Melbourne,
Victoria 3010, Australia.
Shane Culpepper
School of Computer Science and Information Technology,
RMIT University,
Victoria 3001, Australia.
Status
Information Retrieval J., 19(4):416-445, 2016.
Abstract
Batch IR evaluations are usually performed in a framework that
consists of a document collection, a set of queries, a set of
relevance judgments, and one or more effectiveness metrics.
A large number of evaluation metrics have been proposed, with two
primary families having emerged: recall-based metrics, and
utility-based metrics.
In both families, the pragmatics of forming judgments mean that it is
usual to evaluate the metric to some chosen depth such as k=20
or k=100, without necessarily fully considering the
ramifications associated with that choice.
Our aim is this paper is to explore the relative risks arising with
fixed-depth evaluation in the two families, and document the complex
interplay between metric evaluation depth and judgment pooling depth.
Using a range of TREC resources including NewsWire data and the
ClueWeb collection, we: (1) examine the implications of finite
pooling on the subsequent usefulness of different test collections,
including specifying options for truncated evaluation; and (2)
determine the extent to which various metrics correlate with
themselves when computed to different evaluation depths using those
judgments.
We demonstrate that the judgment pools constructed for the ClueWeb
collections lack resilience, and are suited primarily to the
application of top-heavy utility-based metrics rather than
recall-based metrics; and that on the majority of the established
test collections, and across a range of evaluation depths,
recall-based metrics tend to be more volatile in the system rankings
they generate than are utility-based metrics.
That is, experimentation using utility-based metrics is more robust
to choices such as the evaluation depth employed than is
experimentation using recall-based metrics.
This distinction should be noted by researchers as they plan and
execute system-versus-system retrieval experiments.
Full text
http://dx.doi.org/10.1007/s10791-016-9282-6