Relative Significance is Insufficient: Baselines Matter Too


Timothy G. Armstrong
Department of Computer Science and Software Engineering, The University of Melbourne, Victoria 3010, Australia.

Justin Zobel
Department of Computer Science and Software Engineering, The University of Melbourne, Victoria 3010, Australia.

William Webber
Department of Computer Science and Software Engineering, The University of Melbourne, Victoria 3010, Australia.

Alistair Moffat
Department of Computer Science and Software Engineering, The University of Melbourne, Victoria 3010, Australia.


Status

Proc. SIGIR Workshop on the Future of IR Evaluation, Boston, July 2009. Workshop presentation. Now available as a full paper.

Abstract

We have tabulated retrieval effectiveness claims from a large number of information retrieval research papers from 1998–2008, a period that has seen many innovations. The results of our analysis are not encouraging. Over this period, although a great many papers claimed significant effectiveness improvements, there has been no overall gain in absolute retrieval effectiveness on TREC ad hoc collections. A decade of development has not, it appears, led to better systems. To promote verifiable improvement, reporting practices that allow rigorous comparison with prior results are needed. We propose several measures: ongoing longitudinal surveys; better reporting of baselines and use of standard systems; and use of resources such as our evaluatIR.org, an accessible database of test results.

Full text

http://staff.science.uva.nl/~kamps/ireval/papers/paper_2.pdf .