The distribution of subword counts is usually normal (Q685313): Difference between revisions

Let \({\mathcal A}\) be a finite set, which we will call our alphabet. A generalized word of length \(\ell\) is a non-empty subset of \({\mathcal A}^ \ell\); i.e. a collection of words all of the same length. The number of occurrences of the generalized word \(W\) in the string \(S\in {\mathcal A}^ n\) is the number of \(i\in [0,n-\ell]\) such that for some \(w\in W\) we have \(w_ j=S_{i+j}\) for \(1\leq j\leq\ell\). We endow \({\mathcal A}^ n\) with a probability measure by assuming that its elements are generated by a Bernoulli process. Let \({\mathcal W}=(W_ 0,\dots, W_ d)\) be a list of generalized words and let \(X_{n,i}\) be a random variable that is equal to the number of occurrences of \(W_ i\) in a random element of \({\mathcal A}^ n\). Our goal is to show that under fairly general conditions on \({\mathcal W}\) the random variables \(X_{n,i}\) \((c<i\leq d)\) satisfy a local limit theorem as \(n\to\infty\): the \((X_{n,i}-nm_ i)/\sqrt{n}\) converge to a mean zero joint normal for appropriate \(m_ i\), where we condition on \(X_{n,0}=0\) and on \(X_{n,i}\) for \(1\leq i\leq c\). Essentially, the non-zero conditioned counts may go to infinity in any manner as long as they stay a rasonable distance within the limits imposed on the requirement that counts must be possible for some words. Although formulas can be given for the \(m_ i\) and the covariance matrix of the limiting distribution, they are usually too messy to compute. An important exception occurs when there is no conditioning. In that case, we give rather simple formulas. The next section contains the statement of the results discussed in the last paragraph, together with the terminology needed to state the results. In Section 3 we discuss how our results can be used to obtain Mood's results [\textit{A. M. Mood}, The distribution theory of runs, Ann. Math. Stat. 11, 367-392 (1940; Zbl 0024.05301)] for runs count statistics. In Sections 4 and 5, we state and prove some more general results of Markov chains that are used in Section 6 to prove our results for words.

0 references

Mathematics Subject Classification ID

68R15

0 references

0 references

0 references

runs count statistics

0 references

Markov chains

0 references

MaRDI profile type

MaRDI publication profile

0 references

full work available at URL

https://doi.org/10.1006/eujc.1993.1030

0 references

OpenAlex ID

W2034145759

0 references

Sitelinks

Mathematics(1 entry)

mardi Publication:685313

@@ Property / MaRDI profile type @@
+MaRDI publication profile
@@ Property / MaRDI profile type: MaRDI publication profile / rank @@
+Normal rank
@@ Property / full work available at URL @@
+https://doi.org/10.1006/eujc.1993.1030
+Normal rank
@@ Property / OpenAlex ID @@
+W2034145759
@@ Property / OpenAlex ID: W2034145759 / rank @@
+Normal rank
@@ links / mardi / name / links / mardi / name @@
+Publication:685313