Adding Compression and Blended Search to a Compact Two-Level Suffix Array


Simon Gog
Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia.

Alistair Moffat
Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia.


Status

Proc. Symp. String Processing and Information Retrieval, Jerusalem, October 2013, pages 141-152, LNCS volume 8214.

Abstract

The suffix array is an efficient in-memory data structure for pattern search; and two-level variants also exist that are suited to external searching and can handle strings larger than the available memory. Assuming the latter situation, we introduce a factor-based mechanism for compressing the text string that integrates seamlessly with the in-memory index search structure, rather than requiring a separate dictionary. We also introduce a mixture of indexed and sequential pattern search in a trade-off that allows further space savings. Experiments on a 4 GB computer with 62.5 GB of English text show that a two-level arrangement is possible in which around 2.5% of the text size is required as an index and for which the disk-resident components, including the text itself, occupy less than twice the space of the original text; and with which count queries can be carried out using two disk accesses and less than two milliseconds of CPU time.

Full text

http://dx.doi.org/10.1007/978-3-319-02432-5_18