We advocate the use of information retrieval metric score standardization to allow retrieval effectiveness scores derived on different topics and collections to be more meaningful in themselves and more easily compared.
In order to standardize the score that a system has received from a metric for a run against a given topic, one must know the mean of the scores achieved by the original experimental systems against the topic, and also the standard deviation of those scores. A standardized z-score is then calculated by subtracting the mean from the raw score, and dividing the result by the standard deviation.
A z-score is unbounded both above and below. Most IR metrics in contrast range from 0 to 1. Additionally, the unbounded nature of z-scores can lead to exagerrated influence for outliers when aggregating scores. For these reasons, we suggest mapping the z-score to the range 0 to 1. One function that can be used to do this is the cumulative density function of the Normal distribution. This should be provided by most good statistical packages. Under R, use the pnorm() function.
Standardization factors are provided for the following metrics:
The standardized score of metrics that have been normalized by dividing by the score of an ideal ranking is identical to the standardized score of unnormalized form of the the same metric. Therefore, sAP == snSP == sSP, sDCG == snDCG, sVDCG == snVDCG. Standardization factors for both the normalized and unnormalized forms of these metric pairs are provided for convenience; they should give the same results.
The files below contain the means and standard deviations of all of the above metrics for each of the specified TREC tracks. All experimental systems contributing to those tracks have been included in generating the standardization factors. Each file contains a matrix in CSV format, with topics as rows, and metrics as columns. The first row lists the metric names, and the first column lists the topic ids. The standardization factors may also be downloaded as a single tar.gz file.
If there are any test collections or metrics not listed that you think should be included, then please let us know.
Disclaimer: This page, its content and style, are the responsibility of the author and do not necessarily represent the views, policies, or opinions of The University of Melbourne.