User Variability and IR System Evaluation


Peter Bailey
Microsoft, Australia

Alistair Moffat
Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia.

Falk Scholer
School of Computer Science and Information Technology, RMIT University, Victoria 3001, Australia.

Paul Thomas
CSIRO and Australian National University, Canberra Australia


Status

Proc. 38th Ann. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, Santiago, August 2015, pages 625-634.

Abstract

Test collection design eliminates sources of user variability to make statistical comparisons among information retrieval (IR) systems more affordable. Does this choice unnecessarily limit generalizability of the outcomes to real usage scenarios? We explore two aspects of user variability with regard to evaluating the relative performance of IR systems, assessing effectiveness in the context of a subset of topics from three TREC collections, with the embodied information needs categorized against three levels of increasing task complexity. First, we explore the impact of widely differing queries that searchers construct for the same information need description. By executing those queries, we demonstrate that query formulation is critical to query effectiveness. The results also show that the range of scores characterizing effectiveness for a single system arising from these queries is comparable or greater than the range of scores arising from variation among systems using only a single query per topic. Second, our experiments reveal that searchers display substantial individual variation in the numbers of documents and queries they anticipate needing to issue, and there are underlying significant differences in these numbers in line with increasing task complexity levels. Our conclusion is that test collection design would be improved by the use of multiple query variations per topic, and could be further improved by the use of metrics which are sensitive to the expected numbers of useful documents.

Full text

http://dx.doi.org/10.1145/2766462.2767728 .

Errata

Hmmm, looks like we were careless when preparing our final version: in March 2020 the sentence "The experiment was reviewed and approved by the <anonymous institution> ethics board" was noticed on page 4, that should read "The experiment was reviewed and approved by the CSIRO ethics board". Ooops.