Effective Construction of Relative Lempel-Ziv
Dictionaries
Kewen Liao
Department of Computing and Information Systems,
The University of Melbourne,
Victoria 3010, Australia.
Matthias Petri
Department of Computing and Information Systems,
The University of Melbourne,
Victoria 3010, Australia.
Alistair Moffat
Department of Computing and Information Systems,
The University of Melbourne,
Victoria 3010, Australia.
Anthony Wirth
Department of Computing and Information Systems,
The University of Melbourne,
Victoria 3010, Australia.
Status
Proc. Int. Conf. World-Wide Web, Montreal, Canada, April 2016, pages 807-816.
Abstract
Web crawls generate vast quantities of text, retained and archived by
the search services that initiate them.
To store such data and to allow storage costs to be minimized, while
still providing some level of random access to the compressed data,
efficient and effective compression techniques are critical.
The Relative Lempel Ziv (RLZ) scheme provides fast decompression and
retrieval of documents from within large compressed collections, and
even with a relatively small RAM-resident dictionary, is competitive
relative to adaptive compression schemes.
To date, the dictionaries required by RLZ compression have been
formed from concatenations of substrings regularly sampled from the
underlying document collection, then pruned in a manner that seeks to
retain only the high-use sections.
In this work, we develop new dictionary design heuristics, based on
effective construction, rather than on pruning; we identify
dictionary construction as a (string) covering problem.
To avoid the complications of string covering algorithms on large
collections, we focus on k-mers and their frequencies.
First, with a reservoir sampler, we efficiently identify the most
common k-mers.
Then, since a collection typically comprises regions of local
similarity, we select in each "epoch" a segment whose k-mers together
achieve, locally, the highest coverage score.
The dictionary is formed from the concatenation of these
epoch-derived segments.
Our selection process is inspired by the greedy approach to the Set
Cover problem.
Compared with the best existing pruning method, CARE, our scheme has
a similar construction time, but achieves better compression
effectiveness.
Over several multi-gigabyte document collections, there are relative
gains of up to 27%.
Full text
http://doi.acm.org/10.1145/2872427.2883042
.
Errata
Looks like we committed the avoidable carelessness of having two
people use their own different bibtex tags for the same paper, and
not notice when we should have; refs [4] and [6] are the same.