Score Standardization for Robust Comparison of Retrieval Systems

William Webber
Department of Computer Science and Software Engineering, The University of Melbourne, Victoria 3010, Australia.

Alistair Moffat
Department of Computer Science and Software Engineering, The University of Melbourne, Victoria 3010, Australia.

Justin Zobel
NICTA Victoria Laboratory, Department of Computer Science and Software Engineering, The University of Melbourne, Victoria 3010, Australia.

Status

Proc. 12th Australasian Document Computing Symposium, Melbourne, Australia, December 2007, pages 1-8.

Abstract

Information retrieval systems are evaluated by applying them to standard test collections of documents, topics, and relevance judgements. An evaluation metric is then used to score a system's output for each topic; these scores are averaged to obtain an overall measure of effectiveness. However, different topics have differing degrees of difficulty and differing variability in scores, leading to inconsistent contributions to aggregate system scores and problems in comparing scores between different test collections. In this paper, we propose that per-topic scores be standardized on the observed score distributions of the runs submitted to the original experiment from which the test collection was created. We demonstrate that standardization equalizes topic contributions to system effectiveness scores and improves inter-collection comparability.

Full text

Errata

In Column 2 of Page 6, we state:

To explore [the rate of false positives] [...], we took all 50 of the TREC5 Adhoc Track topics and randomly sampled two subsets of 25 topics each. The two subsets were not required to be disjoint; a disjoint partitioning would distort the extreme results, since if one subset happened to get all the hardest topics, then necessarily the other subset would get all the easiest ones. In fact, due to a programming oversight, the two subsets were in fact constructed so that they were disjoint. Subsequent experiments have determined, however, that non-disjoint sampling without replacement for each topic subset, as was intended in the paper, is in reality the distorting sampling method, and that either disjoint sampling (that is, where all topics are used, partitioning) or else non-disjoint sampling with replacement should be used as the sampling method.