Incorporating User Expectations and Behavior into the
Measurement of Search Effectiveness
Department of Computing and Information Systems,
The University of Melbourne,
Victoria 3010, Australia.
Microsoft Research, Canberra, Australia.
School of Computer Science and Information Technology,
Victoria 3001, Australia.
Microsoft Research, Canberra Australia.
ACM Trans. Information Systems, 35(3)24:1-24:38, June 2017.
Information retrieval systems aim to help users satisfy information
We argue that the goal of the person using the system, and the
pattern of behavior that they exhibit as they proceed to attain that
goal, should be incorporated into the methods and techniques used to
evaluate the effectiveness of IR systems, so that the resulting
effectiveness scores have a useful interpretation that corresponds to
the users' search experience.
In particular, we investigate the role of search task complexity, and
show that it has a direct bearing on the number of relevant answer
documents sought by users in response to an information need,
suggesting that useful effectiveness metrics must be goal
We further suggest that user behavior while scanning results listings
is affected by the rate at which their goal is being realized, and
hence that appropriate effectiveness metrics must be
adaptive to the presence (or not) of relevant documents in
In response to these two observations, we present a new effectiveness
metric, INST, that has both of the desired properties: INST
employs a parameter T, a direct measure of the user's search goal
that adjusts the top-weightedness of the evaluation score; moreover,
as progress towards the target T is made, the modeled user
behavior is adapted, to reflect the remaining expectations.
INST is experimentally compared to previous effectiveness metrics,
including Average Precision (AP), Normalized Discounted Cumulative
Gain (NDCG), and Rank-Biased Precision (RBP), demonstrating our
claims as to INST's usefulness.
Like RBP, INST is a weighted-precision metric, meaning that
each score can be accompanied by a residual that quantifies
the extent of the score uncertainty caused by unjudged documents.
As part of our experimentation, we use crowd-sourced data and score
residuals to demonstrate that a wide range of queries arise for even
quite specific information needs, and that these variant queries
introduce significant levels of residual uncertainty into typical
These causes of variability have wide-reaching implications for
experiment design, and for the construction of test collections.
The UQV100 collection that is mentioned in this paper is available from
An implementation of INST has been prepared by Bevan Koopman,