Lossy Compression of Quality Scores in Genomic Data
Rodrigo Cánovas
Department of Computing and Information Systems
The University of Melbourne,
Victoria 3010, Australia.
Alistair Moffat
Department of Computing and Information Systems
The University of Melbourne,
Victoria 3010, Australia.
Andrew Turpin
Department of Computing and Information Systems
The University of Melbourne,
Victoria 3010, Australia.
Status
Bioinformatics, 30(15):2130-2136, 2014.
Abstract
Motivation:
Next-generation sequencing technologies are revolutionizing medicine.
Data from sequencing technologies is typically represented as a
string of bases, an associated sequence of per-base quality scores,
and other meta-data; and in aggregate can require a very large amount
of space.
The quality scores show how accurate the bases are with respect to
the sequencing process, that is, how confident the sequencer is of
having called them correctly, and are the largest component in data
sets in which they are retained.
Previous research has examined how to store sequences of bases
effectively; here we add to that knowledge by examining methods for
compressing quality scores.
The quality values originate in a continuous domain, and so if a
fidelity criterion is introduced, it is possible to introduce
flexibility in the way these values are represented, allowing lossy
compression over the quality score data.
Results: We present existing compression options for
quality score data, and then introduce two new lossy techniques.
Experiments measuring the trade-off between compression ratio and
information loss are reported, including quantifying the effect of
lossy representations on a downstream application that carries out
Single Nucleotide Polymorphism and insert/deletion detection.
The new methods are demonstrably superior to other techniques
when assessed against the spectrum of possible tradeoffs between
storage required and fidelity of representation.
Full text
http://dx.doi.org/10.1093/bioinformatics/btu183.