Improving Suffix Array Locality for Fast Pattern Matching on Disk


Ranjan Sinha
Department of Computer Science and Software Engineering, The University of Melbourne, Victoria 3010, Australia.

Simon J. Puglisi
School of Computer Science and Information Technology, RMIT University, Victoria 3001, Australia.

Alistair Moffat
Department of Computer Science and Software Engineering, The University of Melbourne, Victoria 3010, Australia.

Andrew Turpin
School of Computer Science and Information Technology, RMIT University, Victoria 3001, Australia.


Status

Proc. 2008 ACM SIGMOD International Conference on Management of Data, Vancouver, Canada, June 2008, pages 661-672.

Abstract

The suffix tree (or equivalently, the enhanced suffix array) provides efficient solutions to many problems involving pattern matching and pattern discovery in large strings, such as those arising in computational biology. Here we address the problem of arranging a suffix array on disk so that querying is fast in practice. We show that the combination of a small trie and a suffix array-like blocked data structure allows queries to be answered as much as three times faster than the best alternative disk-based suffix array arrangement. Construction of our data structure requires only modest processing time on top of that required to build the suffix tree, and requires negligible extra memory.

Full text

http://doi.acm.org/10.1145/1376616.1376683 .