Fast File Search Using Text Compression

Andrew Turpin
Department of Computer Science, The University of Melbourne, Parkville 3052, Australia.

Alistair Moffat
Department of Computer Science, The University of Melbourne, Parkville 3052, Australia.

Status

Proc. 20th Australasian Computer Science Conference, Sydney, February 1997, pages 1-8.

Abstract

Most people who have used a computer in any way will be familiar with the ``message in a diskstack'' problem: somewhere, in a directory containing tens of thousands of mail messages stored in a hundred or more folders is the crucial information required right now. The traditional solution to this problem is to use grep on entire archives, employing as a search target some keyword or other string thought to be in the desired mail item. There are, however, two problems with this approach. First, the need to be able to make use of grep precludes compression of the various mail folders, contributing to the also familiar ``device full'' problem. Second, for the volume of mail/news that most users now maintain exhaustive searching can be tedious. The solution we propose is that the various files be maintained compressed, but in a form amenable to fast searching. We describe pgrep, a combined compression/searching tool for text files. Using pgrep, a directory of files can be searched for keywords in time comparable to that required by grep, and stored in less than half of the space required by the uncompressed files. We then combine both exhaustive searching and a coarse-grained index to achieve compression and searching in a single package that is more usable than a full-text indexing system and less resource intensive than storing a directory of uncompressed files.

Keywords

Huffman code, prefix-free code, minimum-redundancy code, length-limited code, string searching, grep, information retrieval, text compression.