Statistical Power in Retrieval Experimentation
William Webber
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
Alistair Moffat
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
Justin Zobel
NICTA Victoria Laboratory,
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
Status
Proc. 17th Conference on
Information and Knowledge Management, Napa Valley, CA,
October 2008, pages 571-580.
Abstract
The power of a statistical test specifies the sample size required
to reliably detect a given true effect.
In IR evaluation, the power corresponds to the number of topics that are
likely to be sufficient to detect a certain degree of superiority of
one system over another. To predict the power of a test, one must
estimate the variability of the population being sampled from; here,
of between-system score deltas. This paper demonstrates that basing
such an estimation either on previous experience or on trial
experiments leaves wide margins of error. Iteratively adding more
topics to the test set until power is achieved is more efficient;
however, we show that it leads to a bias in favour
of finding both power and significance. A hybrid methodology
is proposed, and the reporting requirements of the experimenter
using this methodology are laid out.
We also demonstrate that greater statistical power is achieved for
the same relevance assessment effort by evaluating a large number of
topics shallowly than a small number deeply.
Full paper
http://doi.acm.org/10.1145/1458082.1458158