Collection-Independent Document-Centric Impacts

Vo Ngoc Anh
Department of Computer Science and Software Engineering, The University of Melbourne, Victoria 3010, Australia.

Alistair Moffat
Department of Computer Science and Software Engineering, The University of Melbourne, Victoria 3010, Australia.

Status

Proc. Australian Document Computing Symposium, Melbourne, December 13, 2004, pages 25-32.

Abstract

An information retrieval system employs a similarity heuristic to estimate the probability that documents and queries match each other. The heuristic is usually formulated in the context of a collection, so that the relationship between each document and the collection that contains it affects the scoring used to provide the ranked set of answers in response to a query. In this paper we continue our study of document-centric similarity measures, but seek to eliminate the reliance on collection statistics in setting the document-related components of the measure. There is a direct implementation benefit of being able to do this -- it means that impact-sorted inverted indexes can be built with just a single parse of the source text.