Can Deep Effectiveness Metrics Be Evaluated Using Shallow Judgment Pools?


Xiaolu Lu
School of Computer Science and Information Technology, RMIT University, Victoria 3001, Australia.

Alistair Moffat
School of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia.

Shane Culpepper
School of Computer Science and Information Technology, RMIT University, Victoria 3001, Australia.


Status

Proc. 40th Ann. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, Tokyo, Japan, August 2017, pages 35-44.

Abstract

Increasing test collection sizes and limited judgment budgets create measurement challenges for IR batch evaluations, challenges that are greater when using deep effectiveness metrics than when using shallow metrics, because of the increased likelihood that unjudged documents will be encountered. Here we study the problem of metric score adjustment, with the goal of accurately estimating system performance when using deep metrics and limited judgment sets, assuming that dynamic score adjustment is required per topic due to the variability in the number of relevant documents. We seek to induce system orderings that are as close as is possible to the orderings that would arise if full judgments were available.

Starting with depth-based pooling, and no prior knowledge of sampling probabilities, the first phase of our two-stage process computes a background gain for each document based on rank-level statistics. The second stage then accounts for the distributional variance of relevant documents. We also exploit the frequency statistics of pooled relevant documents in order to determine a threshold for dynamically determining the set of topics to be adjusted. Taken together, our results show that: (i) better score estimates can be achieved when compared to previous work; (ii) by setting a global threshold, we are able to adapt our methods to different collections; and (iii) the proposed estimation methods reliably approximate the system orderings achieved when many more relevance judgments are available. We also consider pools generated by a two-strata sampling approach.


Full text

http://dx.doi.org/10.1145/3077136.3080793