Self-Indexing Inverted Files for Fast Text Retrieval


Alistair Moffat
Department of Computer Science, The University of Melbourne, Parkville 3052, Australia.

Justin Zobel
Department of Computer Science, RMIT, GPO Box 2476V, Melbourne 3001, Australia.


Status

ACM Trans. Information Systems, 4(4):349-379, October 1996.

Abstract

Query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. Retrieval time for inverted lists can be greatly reduced by the use of compression, but this adds to the CPU time required. Here we show that the CPU component of query response time for conjunctive Boolean queries and for informal ranked queries can be similarly reduced, at little cost in terms of storage, by the inclusion of an internal index in each compressed inverted list. This method has been applied in a retrieval system for a collection of nearly two million short documents. Our experimental results show that the self-indexing strategy adds less than 20\% to the size of the compressed inverted file, which itself occupies less than 10\% of the indexed text, yet can reduce processing time for Boolean queries of 5--10 terms to under one fifth of the previous cost. Similarly, ranked queries of 40--50 terms can be evaluated in as little as 25\% of the previous time, with little or no loss of retrieval effectiveness.