Fast File Search Using Text Compression
Andrew Turpin
Department of Computer Science,
The University of Melbourne,
Parkville 3052, Australia.
Alistair Moffat
Department of Computer Science,
The University of Melbourne,
Parkville 3052, Australia.
Status
Proc. 20th Australasian Computer Science Conference,
Sydney, February 1997, pages 1-8.
Abstract
Most people who have used a computer in any way will be familiar with
the ``message in a diskstack'' problem: somewhere, in a directory
containing tens of thousands of mail messages stored in a hundred or
more folders is the crucial information required right now.
The traditional solution to this problem is to use grep on entire archives,
employing
as a search target some keyword or other string thought to be in the
desired mail item.
There are, however, two problems with this approach.
First, the need to be able to make use of grep precludes compression
of the various mail folders, contributing to the also
familiar ``device full'' problem.
Second, for the volume of mail/news that most users now maintain
exhaustive searching can be tedious.
The solution we propose is that the various files be maintained
compressed, but in a form amenable to fast searching.
We describe pgrep, a combined compression/searching tool for text
files.
Using pgrep, a directory of files can be searched for keywords in
time comparable to that required by grep, and stored in less than
half of the space required by the uncompressed files.
We then combine both exhaustive searching
and a coarse-grained index to achieve compression and searching in a
single package that is more usable than a full-text
indexing system and less resource intensive than storing a
directory of uncompressed files.
Keywords
Huffman code,
prefix-free code,
minimum-redundancy code,
length-limited code,
string searching,
grep,
information retrieval,
text compression.