Approximate word matches between two random sequences (Q2476396)

Let \({\mathcal L}= \{A,G,C,T\}\) be an alphabet with strand-symmetric probability measure \(\xi= \{\xi_A, \xi_G, \xi_C, \xi_T\}\) and perturbation parameter \(\eta\). That is, \(-1\leq\eta\leq 1\) is the unique number satisfying \(\xi_A= \xi_T= (1/4)(1+ \eta)\), \(\xi_C= \xi_G= (1/4)(1- \eta)\). Let \({\mathbf A}= A_1 A_2\cdots A_n\) and \({\mathbf B}= B_1 B_2\cdots B_n\) be two random squences of length \(n\) over \({\mathcal L}\) such that the letters are i.i.d. Let \({\mathbf x}\) and \({\mathbf y}\) be two words of length \(m< n\) and define the distance between \({\mathbf x}\) and \({\mathbf y}\) by \[ \delta({\mathbf x},{\mathbf y})=\text{number of characters mismatches between \({\mathbf x}\) and \({\mathbf y}\)}. \] We say that \({\mathbf x}\) is a \(k\)-neighbor of \({\mathbf y}\) if \(\delta({\mathbf x},{\mathbf y})\leq k< m\). Furthermore, define the statistic \(D^{(k)}_2\) to be the number of \(k\)-neighborhood \(m\)-word matches between the sequences \({\mathbf A}\) and \({\mathbf B}\), including overlaps. In this paper, the authors consider the mean of \(D^{(k)}_2\) and a lower and upper bound for the variance of \(D^{(k)}_2\) and prove among others the following theorem: If \(\eta\neq 0\), \(m= \alpha\log_{1/p_2}(n)+ C\) with \(0\leq\alpha<{1\over 2}\) and \(C\) a constant and \(0\leq k< m\) is fixed, then \[ {D^{(k)}_2- E(D^{(k)}_2)\over \sqrt{\text{Var}(D^{(k)}_2)}}\overset {d}\Rightarrow{\mathcal N}(0,1)\quad\text{as }{n}\infty. \] Here, \(p_2= \sum_{a\in{\mathcal L}}\xi^2_a\). The results are useful in the study of bioinformatics for expressed sequence tag database searches.

0 references

zbMATH Keywords

DNA sequences

0 references

number of \(m\)-letter word matches

0 references

central limit theorem

0 references

sequence comparison

0 references

word matches

0 references