Observed Volatility in Effectiveness Metrics

Xiaolu Lu
School of Computer Science and Information Technology, RMIT University, Victoria 3001, Australia.

Alistair Moffat
Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia.

Shane Culpepper
School of Computer Science and Information Technology, RMIT University, Victoria 3001, Australia.

Status

Proc. SIGIR RIGOR Wrkshp. on Reproducibility, Inexplicability, and Generalizability of Results", Santiago, August 2015, to appear.

Abstract

Information retrieval research and commercial search system evaluation both rely heavily on the use of batch evaluation and numerical system comparisons using effectiveness metrics. Batch evaluation provides a relatively low-cost alternative to user studies, and permits repeatable and incrementally varying experimentation in research situations in which access to high-volume query/click streams is not possible. As a result, the IR community has invested considerable effort into formulating, justifying, comparing, and contrasting a large number of alternative metrics. In this paper we consider a very simple question: to what extent can the various metrics be said to give rise to stable scores; that is, convergent evaluations in which the process of adding further relevance information creates refined score estimates rather than different score estimates. Underlying this question is a fundamental concern, namely, whether the numeric behavior of metrics provides confidence that comparative system evaluations based on the metrics are robust and defensible.