Similarity Measures for Tracking Information Flow


Don Metzler
Center for Intelligent Information Retrieval, University of Massachusetts, Amherst, MA 01003, United States of America.

Yaniv Bernstein
School of Computer Science and Information Technology, RMIT University, Victoria 3001, Australia.

Bruce Croft
Center for Intelligent Information Retrieval, University of Massachusetts, Amherst, MA 01003, United States of America.

Alistair Moffat
Department of Computer Science and Software Engineering, The University of Melbourne, Victoria 3010, Australia.

Justin Zobel
School of Computer Science and Information Technology, RMIT University, Victoria 3001, Australia.


Status

Proc. 2005 ACM CIKM Int. Conf. Information and Knowledge Management, Bremen, Germany, November 2005, pages 517-524.

Abstract

Text similarity spans a spectrum, with broad topical similarity near one extreme and document identity at the other. Intermediate levels of similarity -- resulting from summarization, paraphrasing, copying, and stronger forms of topical relevance -- are useful for applications such as information flow analysis and question-answering tasks. In this paper, we explore mechanisms for measuring such intermediate kinds of similarity, focusing on the task of identifying where a particular piece of information originated. We consider both sentence-to-sentence and document-to-document comparison, and have incorporated these algorithms into RECAP, a prototype information flow analysis tool. Our experimental results with RECAP indicate that new mechanisms such as those we propose are likely to be more appropriate than existing methods for identifying the intermediate forms of similarity.

Errata

We inadvertently neglected to make reference in our paper to the work of Clough, Gaizauskas, Piao, and Wilks, "Measuring Text Reuse", in ACL 2002, pages 152-159, for which we apologise. There is considerable overlap between their METER project, and our approach with RECAP.