Efficient Indexing of Repeated n-Grams
Samuel Huston
Department of Computer Science,
University of Massachussetts,
Amherst, USA.
Alistair Moffat
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
W. Bruce Croft
Department of Computer Science,
University of Massachussetts,
Amherst, USA.
Status
Proc. 4th ACM International Conference on Web Search
and Data Mining, Hong Kong, February 2011, pages 127-136.
Abstract
The identification of repeated n-gram phrases in text has many
practical applications, including authorship attribution, text reuse
identification, and plagiarism detection.
We consider methods for finding the repeated n-grams in text
corpora, with emphasis on techniques that can be effectively scaled
across a cluster of processors to handle very large amounts of text.
We compare our proposed method to existing techniques using the
1.5 TB TREC ClueWeb-B text collection, using both single-processor
and multi-processor approaches.
The experiments show that our method offers an important tradeoff
between speed and temporary storage space, and provides an
alternative to previous approaches that scales almost linearly in the
length of the sequence, is largely independent of n, and provides a
uniform workload balance across the set of available processors.
Full Text
http://doi.acm.org/10.1145/1935826.1935857