Incorporating User Expectations and Behavior into the
Measurement of Search Effectiveness
Alistair Moffat
Department of Computing and Information Systems,
The University of Melbourne,
Victoria 3010, Australia.
Peter Bailey
Microsoft Research, Canberra, Australia.
Falk Scholer
School of Computer Science and Information Technology,
RMIT University,
Victoria 3001, Australia.
Paul Thomas
Microsoft Research, Canberra Australia.
Status
ACM Trans. Information Systems, 35(3)24:1-24:38, June 2017.
Abstract
Information retrieval systems aim to help users satisfy information
needs.
We argue that the goal of the person using the system, and the
pattern of behavior that they exhibit as they proceed to attain that
goal, should be incorporated into the methods and techniques used to
evaluate the effectiveness of IR systems, so that the resulting
effectiveness scores have a useful interpretation that corresponds to
the users' search experience.
In particular, we investigate the role of search task complexity, and
show that it has a direct bearing on the number of relevant answer
documents sought by users in response to an information need,
suggesting that useful effectiveness metrics must be goal
sensitive.
We further suggest that user behavior while scanning results listings
is affected by the rate at which their goal is being realized, and
hence that appropriate effectiveness metrics must be
adaptive to the presence (or not) of relevant documents in
the ranking.
In response to these two observations, we present a new effectiveness
metric, INST, that has both of the desired properties: INST
employs a parameter T, a direct measure of the user's search goal
that adjusts the top-weightedness of the evaluation score; moreover,
as progress towards the target T is made, the modeled user
behavior is adapted, to reflect the remaining expectations.
INST is experimentally compared to previous effectiveness metrics,
including Average Precision (AP), Normalized Discounted Cumulative
Gain (NDCG), and Rank-Biased Precision (RBP), demonstrating our
claims as to INST's usefulness.
Like RBP, INST is a weighted-precision metric, meaning that
each score can be accompanied by a residual that quantifies
the extent of the score uncertainty caused by unjudged documents.
As part of our experimentation, we use crowd-sourced data and score
residuals to demonstrate that a wide range of queries arise for even
quite specific information needs, and that these variant queries
introduce significant levels of residual uncertainty into typical
experimental evaluations.
These causes of variability have wide-reaching implications for
experiment design, and for the construction of test collections.
Full text
http://dx.doi.org/10.1145/3052768
Data Resource
The UQV100 collection that is mentioned in this paper is available from
http://dx.doi.org/10.4225/49/5726E597B8376.
Software
An implementation of INST has been prepared by Bevan Koopman,
https://github.com/ielab/inst_eval.