Ellen M. Voorhees
National Institute of Standards and Technology,
Gaithersburg, MD 20899 USA
The test collections developed in the TREC workshops have become the collections of choice in the retrieval research community. To verify their reliability, NIST investigated the effect changes in the relevance assessments have on the evaluation of retrieval results. Very high correlations were found among the rankings of systems produced using different relevance judgment sets. The high correlations indicate that the comparative evaluation of retrieval performance is stable despite substantial differences in relevance judgments, and thus reaffirm the use of the TREC collections as laboratory tools.
SIGIR'98