Score Standardization for Robust Comparison of Retrieval Systems
William Webber
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
Alistair Moffat
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
Justin Zobel
NICTA Victoria Laboratory,
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
Status
Proc. 12th Australasian Document Computing Symposium,
Melbourne, Australia, December 2007, pages 1-8.
Abstract
Information retrieval systems are evaluated by applying them to
standard test collections of documents, topics, and relevance
judgements.
An evaluation metric is then used to score a system's output for each
topic; these scores are averaged to obtain an overall measure of
effectiveness.
However, different topics have differing degrees of difficulty and
differing variability in scores, leading to inconsistent
contributions to aggregate system scores and problems in comparing
scores between different test collections.
In this paper, we propose that per-topic scores be standardized
on the observed score distributions of the runs submitted to the
original experiment from which the test collection was created.
We demonstrate that standardization
equalizes topic contributions to system effectiveness
scores and improves inter-collection comparability.
Errata
In Column 2 of Page 6, we state:
To explore [the rate of false positives] [...], we took
all 50 of the TREC5 Adhoc Track topics and randomly
sampled two subsets of 25 topics each.
The two subsets were not required to be disjoint; a disjoint
partitioning would distort the extreme results, since if
one subset happened to get all the hardest topics, then
necessarily the other subset would get all the easiest ones.
In fact, due to a programming oversight, the two subsets were in fact
constructed so that they were disjoint.
Subsequent experiments have determined, however,
that non-disjoint sampling without replacement for each topic
subset, as was intended in the paper, is in reality the
distorting sampling method, and that either disjoint sampling
(that is, where all topics are used, partitioning) or else
non-disjoint sampling with replacement should be used as the
sampling method.