Relative Significance is Insufficient:
Baselines Matter Too
Timothy G. Armstrong
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
Justin Zobel
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
William Webber
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
Alistair Moffat
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
Status
Proc. SIGIR Workshop on the Future of IR Evaluation,
Boston,
July 2009.
Workshop presentation.
Now available as a full paper.
Abstract
We have tabulated retrieval effectiveness claims from a large number
of information retrieval research papers from 1998–2008, a period
that has seen many innovations. The results of our analysis are
not encouraging. Over this period, although a great many papers
claimed significant effectiveness improvements, there has been no
overall gain in absolute retrieval effectiveness on TREC ad hoc collections.
A decade of development has not, it appears, led to better
systems.
To promote verifiable improvement, reporting practices that allow
rigorous comparison with prior results are needed. We propose
several measures: ongoing longitudinal surveys; better reporting of
baselines and use of standard systems; and use of resources such as
our evaluatIR.org, an accessible database of test results.
Full text
http://staff.science.uva.nl/~kamps/ireval/papers/paper_2.pdf
.