Improvements That Don't Add Up: Ad-Hoc Retrieval Results Since 1998
Timothy G. Armstrong
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
Alistair Moffat
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
William Webber
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
Justin Zobel
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
Status
Proc. 18th CIKM Int. Conf. in Information and Knowledge
Management, Hong Kong,
November 2009, pages 601-610.
Abstract
The existence and use of standard test collections in information
retrieval experimentation allows results to be compared between
research groups and over time.
Such comparisons, however, are rarely made. Most
researchers only report results from their own experiments,
a practice that allows lack of overall improvement to go unnoticed.
In this paper, we analyze results achieved on the TREC
Ad-Hoc, Web, Terabyte, and Robust collections as reported in SIGIR
(1998--2008) and CIKM (2004--2008).
Dozens of individual published experiments report effectiveness
improvements, and often claim statistical significance.
However, there is little evidence of improvement in ad-hoc
retrieval technology over the past decade.
Baselines are generally weak, often being below the median original
TREC system. And in only a handful of experiments is the score of the
best TREC automatic run exceeded.
Given this finding, we question the value of achieving even a
statistically significant result over a weak baseline.
We propose that the community adopt a practice of regular
longitudinal comparison to ensure measurable progress, or at least
prevent the lack of it from going unnoticed. We describe
an online database of retrieval runs that facilitates such a practice.
Full text
http://doi.acm.org/10.1145/1645953.1646031
.