##
Sequence complexity for biological sequence analysis

L. Allison, L. Stern, T. Edgoose, and T.I. Dix

Computers and Chemistry 24, 43-55, 2000

Abstract

A new statistical model for DNA considers a sequence to be a mixture
of regions with little structure and regions that are approximate
repeats of other subsequences, i.e. instances of repeats do not need
to match each other exactly. Both forward- and reverse-complementary
repeats are allowed. The model has a small number of parameters which
are fitted to the data. In general there are many explanations for a
given sequence and how to compute the total probability of the data
given the model is shown. Computer algorithms are described for these
tasks. The model can be used to compute the information content of a
sequence, either in total or base by base. This amounts to looking at
sequences from a data-compression point of view and it is argued that
this is a good way to tackle intelligent sequence analysis in general.