The Effect of Pooling and Evaluation Depth on Metric Stability
William Webber
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
Alistair Moffat
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
Justin Zobel
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
Status
Proc. 3rd International Workshop on Evaluating Information Access,
Tokyo, Japan,
June 2010, pages 7-15.
Abstract
The profusion of information retrieval effectiveness metrics has
inspired the development of meta-evaluative criteria for choosing
between them.
One such criterion is discriminative power; that is, the proportion
of system pairs whose difference in effectiveness is found
statistically significant.
Studies of discriminative power frequently find normalized discounted
cumulative gain (nDCG) to be the most discriminative metric, but
there has been no satisfactory explanation of which feature makes it
so discriminative.
In this paper, we examine the discriminative power of nDCG and
several other metrics under different evaluation and pooling depths,
and with different forms of score normalization.
We find that evaluation depth is more important to metric behaviour
and discriminative power than metric type; that evaluating beyond
pooling depth does not seem to lead to a misleading system
reinforcement effect; and that nDCG does seem to have a genuine,
albeit slight, edge in discriminative power under a range of
conditions.
Full text
http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings8/EVIA/03-EVIA2010-WebberW.pdf
.