The Effect of Pooling and Evaluation Depth on Metric Stability

William Webber
Department of Computer Science and Software Engineering, The University of Melbourne, Victoria 3010, Australia.

Alistair Moffat
Department of Computer Science and Software Engineering, The University of Melbourne, Victoria 3010, Australia.

Justin Zobel
Department of Computer Science and Software Engineering, The University of Melbourne, Victoria 3010, Australia.

Status

Proc. 3rd International Workshop on Evaluating Information Access, Tokyo, Japan, June 2010, pages 7-15.

Abstract

The profusion of information retrieval effectiveness metrics has inspired the development of meta-evaluative criteria for choosing between them. One such criterion is discriminative power; that is, the proportion of system pairs whose difference in effectiveness is found statistically significant. Studies of discriminative power frequently find normalized discounted cumulative gain (nDCG) to be the most discriminative metric, but there has been no satisfactory explanation of which feature makes it so discriminative. In this paper, we examine the discriminative power of nDCG and several other metrics under different evaluation and pooling depths, and with different forms of score normalization. We find that evaluation depth is more important to metric behaviour and discriminative power than metric type; that evaluating beyond pooling depth does not seem to lead to a misleading system reinforcement effect; and that nDCG does seem to have a genuine, albeit slight, edge in discriminative power under a range of conditions.

Full text

http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings8/EVIA/03-EVIA2010-WebberW.pdf .