Models and Metrics: IR Evaluation as a User Process

Alistair Moffat
Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia.

Falk Scholer
School of Computer Science and Information Technology, RMIT University, Victoria 3001, Australia.

Paul Thomas
ICT Centre, CSIRO, Canberra, Australia.


Proc. 17th Australasian Document Computing Symp., Dunedin, New Zealand, December 2012, pages 47-54.


Retrieval system effectiveness can be measured in two quite different ways: by monitoring the behavior of users and gathering data about the ease and accuracy with which they accomplish certain specified information-seeking tasks; or by using numeric effectiveness metrics to score system runs in reference to a set of relevance judgments. The former has the benefit of directly assessing the actual goal of the system, namely the user's ability to complete a search task; whereas the latter approach has the benefit of being quantitative and repeatable. Each given effectiveness metric is an attempt to bridge the gap between these two evaluation approaches, since the implicit belief supporting the use of any particular metric is that user task performance should be correlated with the numeric score provided by the metric. In this work we explore that linkage, considering a range of effectiveness metrics, and the user search behavior that each of them implies. We then examine more complex user models, as a guide to the development of new effectiveness metrics. We conclude by summarizing an experiment that we believe will help establish the strength of the linkage between models and metrics.

Full text


The equation for L_{AP} at the top of the fourth page (page 50 in the printed proceedings) is incorrect. The best formulation of AP (using the described framework of W, C, and L) is as:
	C_{AP}(i) = \frac{
			\sum_{j={i+1}}^{\infty} (r_j/j)
			\sum_{j=i}^{\infty} (r_j/j)
We got one step closer in CIKM 2013, but note that we didn't quite get it right there either.