Re-Store: A System for Compressing, Browsing, and Searching Large Documents
Alistair Moffat
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
Raymond Wan
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
Status
Proc. International Symposium on String Processing
and Information Retrieval",
Laguna de San Rafael, Chile, November 2001, 162-174.
Abstract
We describe a software system for managing text files of up to
several hundred megabytes that combines a number of useful
facilities.
First, the text is stored compressed using a variant of the Re-Pair
mechanism described by Larsson and Moffat, with space savings
comparable to those obtained by other widely used general-purpose
compression systems.
Second, we provide, as a byproduct of the compression process, a
phrase-based browsing tool that allows users to explore the contents
of the source text in a natural and useful manner.
And third, once a set of desired phrases has been determined through
the use of the browsing tool, the compressed text can be searched to
determine locations at which those phrases appear, without
decompressing the whole of the stored text, and without use of an
additional index.
That is, we show how the Re-Pair compression regime can be extended
to allow phrase-based browsing and fast interactive searching.