Standardization Factors for IR Test Collections

We advocate the use of information retrieval metric score standardization to allow retrieval effectiveness scores derived on different topics and collections to be more meaningful in themselves and more easily compared.

In order to standardize the score that a system has received from a metric for a run against a given topic, one must know the mean of the scores achieved by the original experimental systems against the topic, and also the standard deviation of those scores. A standardized z-score is then calculated by subtracting the mean from the raw score, and dividing the result by the standard deviation.

A z-score is unbounded both above and below. Most IR metrics in contrast range from 0 to 1. Additionally, the unbounded nature of z-scores can lead to exagerrated influence for outliers when aggregating scores. For these reasons, we suggest mapping the z-score to the range 0 to 1. One function that can be used to do this is the cumulative density function of the Normal distribution. This should be provided by most good statistical packages. Under R, use the pnorm() function.


Standardization factors are provided for the following metrics:

  1. Average Precision (AP).
  2. Sum of Precisions (SP). This AP without normalizing by the total number of (known) relevant documents.
  3. Discounted Cumulative Gain (DCG). As originally described by Jarvelin and Kekalainen, with discounting base b = 2.
  4. Normalized Discounted Cumulative Gain (nDCG).
  5. Variant Discounted Cumulative Gain (VDCG), b = 2. As described by Burges et al.. This discounts by the log of the rank r + 1, not the rank, and so does not need to adjust for r < b.
  6. Normalized Variant Discounted Cumulative Gain (nVDCG).
  7. Precision at 10 (P@10).
  8. Reciprocal Rank (RR).
  9. Rank-Biased Precision, persistence parameter p=0.8 (RBP.8).
  10. Rank-Biased Precision, p=0.95 (RBP.95).
  11. Precision at R (RP).

The standardized score of metrics that have been normalized by dividing by the score of an ideal ranking is identical to the standardized score of unnormalized form of the the same metric. Therefore, sAP == snSP == sSP, sDCG == snDCG, sVDCG == snVDCG. Standardization factors for both the normalized and unnormalized forms of these metric pairs are provided for convenience; they should give the same results.

The files below contain the means and standard deviations of all of the above metrics for each of the specified TREC tracks. All experimental systems contributing to those tracks have been included in generating the standardization factors. Each file contains a matrix in CSV format, with topics as rows, and metrics as columns. The first row lists the metric names, and the first column lists the topic ids. The standardization factors may also be downloaded as a single tar.gz file.

TREC-3 AdHoc Track

Means   Standard deviations

TREC-4 AdHoc Track

Means   Standard deviations

TREC-5 AdHoc Track

Means   Standard deviations

TREC-6 AdHoc Track

Means   Standard deviations

TREC-7 AdHoc Track

Means   Standard deviations

TREC-8 AdHoc Track

Means   Standard deviations

TREC-9 Web Track

Means   Standard deviations

TREC 2001 Web Track

Means   Standard deviations

TREC 2002 Web Track

Means   Standard deviations

TREC 2003 Web Track

Means   Standard deviations

TREC 2003 Robust Track

Means   Standard deviations

TREC 2004 Web Track

Means   Standard deviations

TREC 2004 Robust Track

Means   Standard deviations

TREC 2004 Terabyte Track

Means   Standard deviations

TREC 2005 Robust Track

Means   Standard deviations

TREC 2005 Terabyte Track

Means   Standard deviations

TREC 2006 Terabyte Track AdHoc Task

Means   Standard deviations

If there are any test collections or metrics not listed that you think should be included, then please let us know.


Last modified: Thu Jul 17 12:05:51 EST 2008

Disclaimer: This page, its content and style, are the responsibility of the author and do not necessarily represent the views, policies, or opinions of The University of Melbourne.