Approximate word matches between two random sequences (Q2476396)

From MaRDI portal





scientific article
Language Label Description Also known as
default for all languages
No label defined
    English
    Approximate word matches between two random sequences
    scientific article

      Statements

      Approximate word matches between two random sequences (English)
      0 references
      0 references
      0 references
      19 March 2008
      0 references
      Let \({\mathcal L}= \{A,G,C,T\}\) be an alphabet with strand-symmetric probability measure \(\xi= \{\xi_A, \xi_G, \xi_C, \xi_T\}\) and perturbation parameter \(\eta\). That is, \(-1\leq\eta\leq 1\) is the unique number satisfying \(\xi_A= \xi_T= (1/4)(1+ \eta)\), \(\xi_C= \xi_G= (1/4)(1- \eta)\). Let \({\mathbf A}= A_1 A_2\cdots A_n\) and \({\mathbf B}= B_1 B_2\cdots B_n\) be two random squences of length \(n\) over \({\mathcal L}\) such that the letters are i.i.d. Let \({\mathbf x}\) and \({\mathbf y}\) be two words of length \(m< n\) and define the distance between \({\mathbf x}\) and \({\mathbf y}\) by \[ \delta({\mathbf x},{\mathbf y})=\text{number of characters mismatches between \({\mathbf x}\) and \({\mathbf y}\)}. \] We say that \({\mathbf x}\) is a \(k\)-neighbor of \({\mathbf y}\) if \(\delta({\mathbf x},{\mathbf y})\leq k< m\). Furthermore, define the statistic \(D^{(k)}_2\) to be the number of \(k\)-neighborhood \(m\)-word matches between the sequences \({\mathbf A}\) and \({\mathbf B}\), including overlaps. In this paper, the authors consider the mean of \(D^{(k)}_2\) and a lower and upper bound for the variance of \(D^{(k)}_2\) and prove among others the following theorem: If \(\eta\neq 0\), \(m= \alpha\log_{1/p_2}(n)+ C\) with \(0\leq\alpha<{1\over 2}\) and \(C\) a constant and \(0\leq k< m\) is fixed, then \[ {D^{(k)}_2- E(D^{(k)}_2)\over \sqrt{\text{Var}(D^{(k)}_2)}}\overset {d}\Rightarrow{\mathcal N}(0,1)\quad\text{as }{n}\infty. \] Here, \(p_2= \sum_{a\in{\mathcal L}}\xi^2_a\). The results are useful in the study of bioinformatics for expressed sequence tag database searches.
      0 references
      DNA sequences
      0 references
      number of \(m\)-letter word matches
      0 references
      central limit theorem
      0 references
      sequence comparison
      0 references
      word matches
      0 references

      Identifiers