Observed Volatility in Effectiveness Metrics
Xiaolu Lu
School of Computer Science and Information Technology,
RMIT University,
Victoria 3001, Australia.
Alistair Moffat
Department of Computing and Information Systems,
The University of Melbourne,
Victoria 3010, Australia.
Shane Culpepper
School of Computer Science and Information Technology,
RMIT University,
Victoria 3001, Australia.
Status
Proc. SIGIR RIGOR Wrkshp. on Reproducibility, Inexplicability,
and Generalizability of Results",
Santiago, August 2015, to appear.
Abstract
Information retrieval research and commercial search system
evaluation both rely heavily on the use of batch evaluation and
numerical system comparisons using effectiveness
metrics.
Batch evaluation provides a relatively low-cost alternative to user
studies, and permits repeatable and incrementally varying
experimentation in research situations in which access to high-volume
query/click streams is not possible.
As a result, the IR community has invested considerable effort into
formulating, justifying, comparing, and contrasting a large number of
alternative metrics.
In this paper we consider a very simple question: to what extent can
the various metrics be said to give rise to stable scores; that is,
convergent evaluations in which the process of adding further
relevance information creates refined score estimates rather than
different score estimates.
Underlying this question is a fundamental concern, namely, whether
the numeric behavior of metrics provides confidence that comparative
system evaluations based on the metrics are robust and
defensible.
Full paper (PDF)
.