Adding Compression and Blended Search to a
Compact Two-Level Suffix Array
Simon Gog
Department of Computing and Information Systems,
The University of Melbourne,
Victoria 3010, Australia.
Alistair Moffat
Department of Computing and Information Systems,
The University of Melbourne,
Victoria 3010, Australia.
Status
Proc. Symp. String Processing and Information Retrieval,
Jerusalem, October 2013, pages 141-152, LNCS volume 8214.
Abstract
The suffix array is an efficient in-memory data structure for pattern
search; and two-level variants also exist that are suited to external
searching and can handle strings larger than the available memory.
Assuming the latter situation, we introduce a factor-based mechanism
for compressing the text string that integrates seamlessly with the
in-memory index search structure, rather than requiring a separate
dictionary.
We also introduce a mixture of indexed and sequential pattern search
in a trade-off that allows further space savings.
Experiments on a 4 GB computer with 62.5 GB of English text show that
a two-level arrangement is possible in which around 2.5% of the text
size is required as an index and for which the disk-resident
components, including the text itself, occupy less than twice the
space of the original text; and with which count queries can
be carried out using two disk accesses and less than two milliseconds
of CPU time.
Full text
http://dx.doi.org/10.1007/978-3-319-02432-5_18