Inferring a DNA sequence from erroneous copies (Q1390940)

In the laboratory technique of \textit{single-molecule DNA sequencing}, index \textit{single-molecule DNA sequencing} single-stranded DNA is cut, a single base at a time. This base then flows down a microscopic tube at high speed past an optical sensor, which detects the type of base. This technique is error-prone, due to sputtering, especially at the beginning and end of the DNA molecule being cut. By repeating the process many times, we produce a collection of \(N\) erroneous copies of an original DNA sequence \(s_1,\dots,s_n\), which is to be determined. In this article, the authors design and analyze a polynomial time algorithm to determine the original DNA sequence from poly-logarithmically many erroneous copies. Here is the algorithm: Given \(N\) erroneous copies of an original DNA sequence to be determined, let \(M=N\). Partition the \(M\) sequences into \(M/3\) many groups of three sequences. Apply dynamic programming to determine an optimal alignment of the three sequences \(U,V,W\) in each group. From this alignment, define the \textit{concensus sequence} \(S\) obtained by taking the majority symbol in each column, and note that \(S\) is a \textit{Steiner sequence} for \(U,V,W\), meaning that \(S \in \{ A,C,G,T \}^*\) satisfies \[ D(S,U)+D(S,V)+D(S,W)\;\text{equals} \min \left\{D(T,U)+D(T,V)+D(T,W) : T \in \{ A,C,G,T\}^*\right\}. \] The previous step yields \(M/3\) many Steiner sequences. If \(M=3\) then stop and output the resulting sequence. Else let \(M = M/3\), and return to step (2). The algorithm's runtime is \( O(N n^3)\). How well does this heuristics perform in determining the original sequence \(s_1,\ldots,s_n\) from erroneous copies produced by single-molecule DNA sequencing, and how many erroneous copies are required to determine \(s_1,\ldots,s_n\) with high probability? The authors answer these questions, subject to the hypotheses that the original DNA sequence \(s_1,\ldots,s_n\) is a random sequence (technically \textit{Kolmogorov random}, or \textit{incompressible}) and insertion, deletion and substitution are equally likely. The former assumption is biologically unrealistic, especially for DNA sequences of interest (genes, promoter sequences, etc.), while the latter assumption seems reasonable, given the physical device used for single-molecule DNA sequencing. Under these assumptions, it is shown that step (3) of the above algorithm reduces an original error rate of \(\epsilon\) to almost \(\epsilon^2\). It suffices to reduce the error rate to less than \(1/n\), since the expected number of errors in a sequence of length \(n\) will be less than \(n \cdot 1/n = 1\), and the original sequence will have been determined. To this end, let \(c\) be such that \(\epsilon = 1/c\), and note that the least \(k\) such that \(\epsilon^{2^k}=1/c^{2^k} < 1/n\) is \(\lceil \log_2 \log_2 n - \log_2 \log_2 c \rceil = O( \log_2 \log_2 n)\). Assuming that the resulting \(M/3\) Steiner sequences in each pass of the algorithm are random (this has no analytic justification), the authors then argue that it suffices to take \(N = 3^{k} = 3^{\log_2 \log_2 n} = (\log_2 n)^{\log_2 3} \approx \log_2^{1.5849} n\), which is a small number of erroneous copies.

0 references

reviewed by

Peter Clote

0 references

zbMATH Keywords

multiple sequence alignments

0 references

DNA sequencing

0 references

MaRDI profile type

MaRDI publication profile

0 references