User Variability and IR System Evaluation
Department of Computing and Information Systems,
The University of Melbourne,
Victoria 3010, Australia.
School of Computer Science and Information Technology,
Victoria 3001, Australia.
CSIRO and Australian National University,
Proc. 38th Ann. Int. ACM SIGIR Conf. on
Research and Development in Information Retrieval,
Santiago, August 2015, pages 625-634.
Test collection design eliminates sources of user variability to make
statistical comparisons among information retrieval (IR) systems more
Does this choice unnecessarily limit generalizability of the outcomes
to real usage scenarios?
We explore two aspects of user variability with regard to evaluating
the relative performance of IR systems, assessing effectiveness in
the context of a subset of topics from three TREC collections, with
the embodied information needs categorized against three levels of
increasing task complexity.
First, we explore the impact of widely differing queries that
searchers construct for the same information need description.
By executing those queries, we demonstrate that query formulation is
critical to query effectiveness.
The results also show that the range of scores characterizing
effectiveness for a single system arising from these queries is
comparable or greater than the range of scores arising from variation
among systems using only a single query per topic.
Second, our experiments reveal that searchers display substantial
individual variation in the numbers of documents and queries they
anticipate needing to issue, and there are underlying significant
differences in these numbers in line with increasing task complexity
Our conclusion is that test collection design would be improved by
the use of multiple query variations per topic, and could be further
improved by the use of metrics which are sensitive to the expected
numbers of useful documents.