Similarity Measures for Tracking Information Flow
Don Metzler
Center for Intelligent Information Retrieval,
University of Massachusetts,
Amherst, MA 01003, United States of America.
Yaniv Bernstein
School of Computer Science and Information Technology,
RMIT University,
Victoria 3001, Australia.
Bruce Croft
Center for Intelligent Information Retrieval,
University of Massachusetts,
Amherst, MA 01003, United States of America.
Alistair Moffat
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
Justin Zobel
School of Computer Science and Information Technology,
RMIT University,
Victoria 3001, Australia.
Status
Proc. 2005 ACM CIKM Int. Conf. Information and
Knowledge Management, Bremen, Germany, November 2005,
pages 517-524.
Abstract
Text similarity spans a spectrum, with broad topical similarity near
one extreme and document identity at the other.
Intermediate levels of similarity -- resulting from summarization,
paraphrasing, copying, and stronger forms of topical relevance -- are
useful for applications such as information flow analysis and
question-answering tasks.
In this paper, we explore mechanisms for measuring such intermediate
kinds of similarity, focusing on the task of identifying where a
particular piece of information originated.
We consider both sentence-to-sentence and document-to-document
comparison, and have incorporated these algorithms into
RECAP, a prototype information flow analysis tool.
Our experimental results with RECAP indicate that new
mechanisms such as those we propose are likely to be more appropriate
than existing methods for identifying the intermediate forms of
similarity.
Errata
We inadvertently neglected to make reference in our paper to the
work of Clough, Gaizauskas, Piao, and Wilks, "Measuring Text Reuse",
in ACL 2002, pages 152-159, for which we apologise.
There is considerable overlap between their METER project, and our
approach with RECAP.