Needles and Haystacks: A Search Engine for Personal Information Collections


Owen de Kretser
Department of Computer Science and Software Engineering, The University of Melbourne, Parkville 3052, Australia.

Alistair Moffat
Department of Computer Science and Software Engineering, The University of Melbourne, Parkville 3052, Australia.


Status

Proc. 23nd Australasian Computer Science Conference, Canberra, Australia, February 2000, 58-65.

Abstract

Information retrieval systems can be partitioned into two main classes: large-scale systems that make use of an inverted index or some other auxiliary data structure, intended for massive volumes of data; and the small-scale systems based upon sequential pattern matching that most computer users employ when hunting for missing email and news items. In this paper we describe a hybrid approach that offers the ranked queries and similarity matching of a genuine information retrieval system, but does so without any need for an index to be precomputed. This software tool, which we call seft, offers performance that in a retrieval effectiveness sense matches conventional information retrieval systems, and in a resource efficiency sense, while considerably slower than grep-like tools, is fast enough to be useful on hundreds of megabytes of text.

Software

The seft software is available here.