Score Standardization for Robust Comparison of Retrieval Systems


William Webber
Department of Computer Science and Software Engineering, The University of Melbourne, Victoria 3010, Australia.

Alistair Moffat
Department of Computer Science and Software Engineering, The University of Melbourne, Victoria 3010, Australia.

Justin Zobel
NICTA Victoria Laboratory, Department of Computer Science and Software Engineering, The University of Melbourne, Victoria 3010, Australia.


Status

Proc. 12th Australasian Document Computing Symposium, Melbourne, Australia, December 2007, pages 1-8.

Abstract

Information retrieval systems are evaluated by applying them to standard test collections of documents, topics, and relevance judgements. An evaluation metric is then used to score a system's output for each topic; these scores are averaged to obtain an overall measure of effectiveness. However, different topics have differing degrees of difficulty and differing variability in scores, leading to inconsistent contributions to aggregate system scores and problems in comparing scores between different test collections. In this paper, we propose that per-topic scores be standardized on the observed score distributions of the runs submitted to the original experiment from which the test collection was created. We demonstrate that standardization equalizes topic contributions to system effectiveness scores and improves inter-collection comparability.

Full text


Errata

In Column 2 of Page 6, we state: In fact, due to a programming oversight, the two subsets were in fact constructed so that they were disjoint. Subsequent experiments have determined, however, that non-disjoint sampling without replacement for each topic subset, as was intended in the paper, is in reality the distorting sampling method, and that either disjoint sampling (that is, where all topics are used, partitioning) or else non-disjoint sampling with replacement should be used as the sampling method.