Approximate word matches between two random sequences

DOI10.1214/07-AAP452MaRDI QIDQ2476396zbMATH OpenFDO

Authors Conrad J. Burden, Miriam Ruth Kantorovitz, Susan R. Wilson

Publication date 19 March 2008

Published in The Annals of Applied Probability (Search for Journal in Brave)

Full work available at URL https://arxiv.org/abs/0801.3145

zbMATH Keywords

DNA sequences central limit theorem sequence comparison word matches number of \(m\)-letter word matches

Mathematics Subject Classification ID

Protein sequences, DNA sequences (92D20) Functional limit theorems; invariance principles (60F17)

Abstract: Given two sequences over a finite alphabet

m a t h c a l L

, the

D_{2}

statistic is the number of

m

-letter word matches between the two sequences. This statistic is used in bioinformatics for expressed sequence tag database searches. Here we study a generalization of the

D_{2}

statistic in the context of DNA sequences, under the assumption of strand symmetric Bernoulli text. For

k < m

, we look at the count of

m

-letter word matches with up to

k

mismatches. For this statistic, we compute the expectation, give upper and lower bounds for the variance and prove its distribution is asymptotically normal.

Recommendations

Cites work

Cited in

(9)

Describes a project that uses

Uses Software

d2_cluster

This page was built for publication: Approximate word matches between two random sequences

Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q2476396)