Reducing Space Requirements for Disk Resident Suffix Arrays
Alistair Moffat
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
Simon J. Puglisi
School of Computer Science and Information Technology,
RMIT University,
Victoria 3001, Australia.
Ranjan Sinha
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
Status
Proc. Conf. Database Systems for Advanced Applications,
Brisbane, Australia, April 2009, pages 730-744, LNCS volume 5463.
Abstract
Suffix trees and suffix arrays are important data structures for
string processing, providing efficient solutions for many
applications involving pattern matching.
Recent work by Sinha et al. (SIGMOD 2008) addressed the problem of
arranging a suffix array on disk so that querying is fast, and showed
that the combination of a small trie and a suffix array-like blocked
data structure allows queries to be answered many times faster than
alternative disk-based suffix trees.
A drawback of their LOF-SA structure, and common to all current disk
resident suffix tree/array approaches, is that the space requirement
of the data structure, though on disk, is large relative to the text
-- for the LOF-SA, 13n bytes including the underlying n byte
text.
In this paper we explore techniques for reducing the space required
by the LOF-SA.
Experiments show these methods cut the data structure to nearly half
its original size, without, for large strings that necessitate
on-disk structures, any impact on search times.
Full text
http://dx.doi.org/10.1007/978-3-642-00887-0_63
.