Adding Compression and Blended Search to a
Compact Two-Level Suffix Array
Simon Gog
Department of Computing and Information Systems,
The University of Melbourne,
Victoria 3010, Australia.
Alistair Moffat
Department of Computing and Information Systems,
The University of Melbourne,
Victoria 3010, Australia.
Matthias Petri
Department of Computing and Information Systems,
The University of Melbourne,
Victoria 3010, Australia.
Status
Proc. Symp. String Processing and Information Retrieval,
Ouro Preto, Brazil, October 2014, pages 1-12, LNCS volume 8799.
Abstract
We consider the problem of pattern-search in compressed text in a
context in which: (a) the text is stored as a sequence of factors
against a static phrase-book; (b) decoding of factors is from
right-to-left; and (c) extraction of each symbol in each factor
requires Theta(log sigma) time, where sigma is the size of the
original alphabet.
To determine possible alignments given information about decoded
characters we introduce two Boyer-Moore-like searching mechanisms,
including one that makes use of a suffix array constructed over the
pattern.
The new mechanisms decode fewer than half the symbols that are
required by a sequential left-to-right search such as the Knuth-Morris-Pratt
approach, a saving that translates directly into improved execution
time.
Experiments with a two-level suffix array index structure for
4 GB of English text demonstrate the usefulness of the new
techniques.
Full text
http://dx.doi.org/10.1007/978-3-319-11918-2_1