Comparative Analysis of Long DNA Sequences by Per Element Information Content Using Different Contexts

Trevor I Dix, David R Powell, Lloyd Allison, Julie Bernal, Samira Jaeger, and Linda Stern

BMC Bioinformatics 8(2), S10, 2007.

Features of a DNA sequence can be found by compressing the sequence under a suitable model; good compression implies low information content. Good DNA compression models consider repetition, differences, and base distributions. From a linear DNA sequence, a compression model can produce a linear information sequence. Linear space is important when exploring long DNA sequences of the order of millions of bases. Compressing a sequence in isolation will include information on self-repetition. Whereas compressing a sequence Y in the context of another X can find what new information X gives about Y. This paper presents a methodology for performing comparative analysis to find features by such models.

We apply such a model to find features across chromosomes for \emph{Cyanidioschyzon merolae}. We present a tool that provides useful linear transformations to investigate and save new sequences. Various examples illustrate the methodology, finding features for sequences alone and in different contexts.