Approximate word matches between two random sequences (Q2476396)

From MaRDI portal





scientific article; zbMATH DE number 5251129
Language Label Description Also known as
default for all languages
No label defined
    English
    Approximate word matches between two random sequences
    scientific article; zbMATH DE number 5251129

      Statements

      Approximate word matches between two random sequences (English)
      0 references
      0 references
      0 references
      19 March 2008
      0 references
      Let \({\mathcal L}= \{A,G,C,T\}\) be an alphabet with strand-symmetric probability measure \(\xi= \{\xi_A, \xi_G, \xi_C, \xi_T\}\) and perturbation parameter \(\eta\). That is, \(-1\leq\eta\leq 1\) is the unique number satisfying \(\xi_A= \xi_T= (1/4)(1+ \eta)\), \(\xi_C= \xi_G= (1/4)(1- \eta)\). Let \({\mathbf A}= A_1 A_2\cdots A_n\) and \({\mathbf B}= B_1 B_2\cdots B_n\) be two random squences of length \(n\) over \({\mathcal L}\) such that the letters are i.i.d. Let \({\mathbf x}\) and \({\mathbf y}\) be two words of length \(m< n\) and define the distance between \({\mathbf x}\) and \({\mathbf y}\) by \[ \delta({\mathbf x},{\mathbf y})=\text{number of characters mismatches between \({\mathbf x}\) and \({\mathbf y}\)}. \] We say that \({\mathbf x}\) is a \(k\)-neighbor of \({\mathbf y}\) if \(\delta({\mathbf x},{\mathbf y})\leq k< m\). Furthermore, define the statistic \(D^{(k)}_2\) to be the number of \(k\)-neighborhood \(m\)-word matches between the sequences \({\mathbf A}\) and \({\mathbf B}\), including overlaps. In this paper, the authors consider the mean of \(D^{(k)}_2\) and a lower and upper bound for the variance of \(D^{(k)}_2\) and prove among others the following theorem: If \(\eta\neq 0\), \(m= \alpha\log_{1/p_2}(n)+ C\) with \(0\leq\alpha<{1\over 2}\) and \(C\) a constant and \(0\leq k< m\) is fixed, then \[ {D^{(k)}_2- E(D^{(k)}_2)\over \sqrt{\text{Var}(D^{(k)}_2)}}\overset {d}\Rightarrow{\mathcal N}(0,1)\quad\text{as }{n}\infty. \] Here, \(p_2= \sum_{a\in{\mathcal L}}\xi^2_a\). The results are useful in the study of bioinformatics for expressed sequence tag database searches.
      0 references
      DNA sequences
      0 references
      number of \(m\)-letter word matches
      0 references
      central limit theorem
      0 references
      sequence comparison
      0 references
      word matches
      0 references

      Identifiers