Adding Compression and Blended Search to a Compact Two-Level Suffix Array

Simon Gog
Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia.

Alistair Moffat
Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia.

Matthias Petri
Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia.

Status

Proc. Symp. String Processing and Information Retrieval, Ouro Preto, Brazil, October 2014, pages 1-12, LNCS volume 8799.

Abstract

We consider the problem of pattern-search in compressed text in a context in which: (a) the text is stored as a sequence of factors against a static phrase-book; (b) decoding of factors is from right-to-left; and (c) extraction of each symbol in each factor requires Theta(log sigma) time, where sigma is the size of the original alphabet. To determine possible alignments given information about decoded characters we introduce two Boyer-Moore-like searching mechanisms, including one that makes use of a suffix array constructed over the pattern. The new mechanisms decode fewer than half the symbols that are required by a sequential left-to-right search such as the Knuth-Morris-Pratt approach, a saving that translates directly into improved execution time. Experiments with a two-level suffix array index structure for 4 GB of English text demonstrate the usefulness of the new techniques.

Full text

http://dx.doi.org/10.1007/978-3-319-11918-2_1