Improving Suffix Array Locality for Fast Pattern Matching on Disk
Ranjan Sinha
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
Simon J. Puglisi
School of Computer Science and Information Technology,
RMIT University,
Victoria 3001, Australia.
Alistair Moffat
Department of Computer Science and Software Engineering,
The University of Melbourne,
Victoria 3010, Australia.
Andrew Turpin
School of Computer Science and Information Technology,
RMIT University,
Victoria 3001, Australia.
Status
Proc. 2008 ACM SIGMOD International Conference on Management of Data,
Vancouver, Canada, June 2008, pages 661-672.
Abstract
The suffix tree (or equivalently, the enhanced suffix array) provides
efficient solutions to many problems involving pattern matching and
pattern discovery in large strings, such as those arising in
computational biology.
Here we address the problem of arranging a suffix array on disk so
that querying is fast in practice.
We show that the combination of a small trie and a suffix array-like
blocked data structure allows queries to be answered as much as three
times faster than the best alternative disk-based suffix array
arrangement.
Construction of our data structure requires only modest processing
time on top of that required to build the suffix tree, and requires
negligible extra memory.
Full text
http://doi.acm.org/10.1145/1376616.1376683
.