Score Aggregation Techniques in Retrieval Experimentation
Sri Devi Ravana
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
Alistair Moffat
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
Status
Proc. Australasian Database Conference,
Wellington, New Zealand, January 2009, pages 59-67,
Volume 92 of Conferences in Research and Practice in
Information Technology.
Abstract
Comparative evaluations of information retrieval systems are based on
a number of key premises, including that representative topic sets
can be created, that suitable relevance judgements can be generated,
and that systems can be sensibly compared based on their aggregate
performance over the selected topic set.
This paper considers the role of the third of these assumptions --
that the performance of a system on a set of topics can be
represented by a single overall performance score such as the
average, or some other central statistic.
In particular, we experiment with score aggregation techniques
including the arithmetic mean, the geometric mean, the harmonic mean,
and the median.
Using past TREC runs we show that an adjusted geometric mean provides
more consistent system rankings than the arithmetic mean when a
significant fraction of the individual topic scores are close to
zero, and that score standardization (Webber et al., SIGIR 2008)
achieves the same outcome in a more consistent manner.