User Variability and IR System Evaluation
Peter Bailey
Microsoft, Australia
Alistair Moffat
Department of Computing and Information Systems,
The University of Melbourne,
Victoria 3010, Australia.
Falk Scholer
School of Computer Science and Information Technology,
RMIT University,
Victoria 3001, Australia.
Paul Thomas
CSIRO and Australian National University,
Canberra
Australia
Status
Proc. 38th Ann. Int. ACM SIGIR Conf. on
Research and Development in Information Retrieval,
Santiago, August 2015, pages 625-634.
Abstract
Test collection design eliminates sources of user variability to make
statistical comparisons among information retrieval (IR) systems more
affordable.
Does this choice unnecessarily limit generalizability of the outcomes
to real usage scenarios?
We explore two aspects of user variability with regard to evaluating
the relative performance of IR systems, assessing effectiveness in
the context of a subset of topics from three TREC collections, with
the embodied information needs categorized against three levels of
increasing task complexity.
First, we explore the impact of widely differing queries that
searchers construct for the same information need description.
By executing those queries, we demonstrate that query formulation is
critical to query effectiveness.
The results also show that the range of scores characterizing
effectiveness for a single system arising from these queries is
comparable or greater than the range of scores arising from variation
among systems using only a single query per topic.
Second, our experiments reveal that searchers display substantial
individual variation in the numbers of documents and queries they
anticipate needing to issue, and there are underlying significant
differences in these numbers in line with increasing task complexity
levels.
Our conclusion is that test collection design would be improved by
the use of multiple query variations per topic, and could be further
improved by the use of metrics which are sensitive to the expected
numbers of useful documents.
Full text
http://dx.doi.org/10.1145/2766462.2767728
.
Errata
Hmmm, looks like we were careless when preparing our final version:
in March 2020 the sentence "The experiment was reviewed and approved
by the <anonymous institution> ethics board" was noticed on
page 4, that should read "The experiment was reviewed and approved by
the CSIRO ethics board".
Ooops.