Re-Store: A System for Compressing, Browsing, and Searching Large Documents


Alistair Moffat
Department of Computer Science and Software Engineering, The University of Melbourne, Victoria 3010, Australia.

Raymond Wan
Department of Computer Science and Software Engineering, The University of Melbourne, Victoria 3010, Australia.


Status

Proc. International Symposium on String Processing and Information Retrieval", Laguna de San Rafael, Chile, November 2001, 162-174.

Abstract

We describe a software system for managing text files of up to several hundred megabytes that combines a number of useful facilities. First, the text is stored compressed using a variant of the Re-Pair mechanism described by Larsson and Moffat, with space savings comparable to those obtained by other widely used general-purpose compression systems. Second, we provide, as a byproduct of the compression process, a phrase-based browsing tool that allows users to explore the contents of the source text in a natural and useful manner. And third, once a set of desired phrases has been determined through the use of the browsing tool, the compressed text can be searched to determine locations at which those phrases appear, without decompressing the whole of the stored text, and without use of an additional index. That is, we show how the Re-Pair compression regime can be extended to allow phrase-based browsing and fast interactive searching.