Estimating System Effectiveness Scores With Incomplete
Evidence
Sri Devi Ravana
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
Alistair Moffat
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
Status
Proc. 15th Australasian Document Computing Symp.,
Melbourne, December 2010, pages 20-27.
Abstract
It is common for only partial relevance judgments to be used when
comparing retrieval system effectiveness, in order to control
experimental cost.
Using TREC data, we consider the uncertainty introduced into
per-topic effectiveness scores by pooled judgments, and measure the
effect that incomplete evidence has on both the systems scores that
are generated, and also on the quality of paired system comparisons.
We measure system behavior from three different points of view: the
trend in effectiveness scores; the separability of system pairs; and
the number of reversals in significance outcomes as the depth of
judgments increases.
Our results show that when shallow pooled judgments are used system
separability remains relatively high, but that there is also a high
rate of significance reversal.
We then show that explicitly adjusting effectiveness scores to allow
for the known amount of uncertainty gives a reduced number of
reversals, and hence more consistent experimental outcomes.