Lossy Compression of Quality Scores in Genomic Data


Rodrigo Cánovas
Department of Computing and Information Systems The University of Melbourne, Victoria 3010, Australia.

Alistair Moffat
Department of Computing and Information Systems The University of Melbourne, Victoria 3010, Australia.

Andrew Turpin
Department of Computing and Information Systems The University of Melbourne, Victoria 3010, Australia.


Status

Bioinformatics, 30(15):2130-2136, 2014.

Abstract

Motivation: Next-generation sequencing technologies are revolutionizing medicine. Data from sequencing technologies is typically represented as a string of bases, an associated sequence of per-base quality scores, and other meta-data; and in aggregate can require a very large amount of space. The quality scores show how accurate the bases are with respect to the sequencing process, that is, how confident the sequencer is of having called them correctly, and are the largest component in data sets in which they are retained. Previous research has examined how to store sequences of bases effectively; here we add to that knowledge by examining methods for compressing quality scores. The quality values originate in a continuous domain, and so if a fidelity criterion is introduced, it is possible to introduce flexibility in the way these values are represented, allowing lossy compression over the quality score data.

Results: We present existing compression options for quality score data, and then introduce two new lossy techniques. Experiments measuring the trade-off between compression ratio and information loss are reported, including quantifying the effect of lossy representations on a downstream application that carries out Single Nucleotide Polymorphism and insert/deletion detection. The new methods are demonstrably superior to other techniques when assessed against the spectrum of possible tradeoffs between storage required and fidelity of representation.


Full text

http://dx.doi.org/10.1093/bioinformatics/btu183.