On the amount of noise inherent in bandwidth selection for a kernel density estimator (Q1117628): Difference between revisions

In the kernel density estimator \[ \hat f_ h=n^{- 1}\sum^{n}_{i=1}h^{-1} K((x-X_ i)/h) \] where \(\{X_ 1,...,X_ n\}\) is a random sample from a distribution which has density function f, the choice of the bandwidth sequence \(h\equiv h_ n\) is crucial to the performance of this estimator. The choice of h which minimizes the mean integrated square error, \[ MISE(\hat f_ h,f)=E \int^{\infty}_{- \infty}(\hat f_ h(x)-f(x))^ 2dx=\int^{\infty}_{-\infty}E(\hat f_ h(x)-f(x))^ 2dx, \] is \(h_ f=n^{-1/5}\alpha (K)\beta (f)\) where \(\alpha\) (K) and \(\beta\) (f) are constants which depend on the kernel K and density f, respectively. Since the choice of K is available to the user, \(\alpha\) (K) is known. However, \[ \beta (f)=[\int^{\infty}_{-\infty}| f^{(2)}(y)|^ 2dy]^{-1/5} \] varies greatly even over well-known statistical distributions, f is unknown, and \(\hat f_ h\) is not robust for poor choices of h. Thus, any practical method of choosing a bandwidth must depend only on the sample. Let \(\hat h_ f\) be the choice of h which minimizes the integrated square error, \[ \Delta (h,f)=\int (\hat f_ h(x)-f(x))^ 2dx. \] Let \(\hat h_ c\) be the least squares, cross-validation choice of h which minimizes \[ CV(h)=\int \hat f_ h(x)^ 2dx-n^{- 1}\sum^{n}_{i=1}\hat f_{h,i}(X_ i) \] where \(\hat f_{h,i}\) denotes the kernel density estimator with the i th observation deleted from the sample. Clearly, \(\hat h_ f\) and \(\hat h_ c\) are functions of the sample, and \(\Delta\) (ĥ\({}_ f,f)\leq \Delta (\hat h_ c,f)\). However, when K is a smooth symmetric density and f is twice differentiable, then \[ \hat h_ c/\hat h_ f-1={\mathcal O}(n^{- 1/10})\quad and\quad \Delta (\hat h_ c,f)/\Delta (\hat h_ f,f)- 1={\mathcal O}(n^{-1/5}), \] where \({\mathcal O}_ p\) denotes bounded in probability. The major results of this paper show that these upper bounds are the best possible in the sense that for \(\hat h\) any measurable function of \(X_ 1,...,X_ n\), \[ \lim_{\epsilon \to 0}\liminf_{n\to \infty}\sup_{f\in F}P_ f[| \hat h/\hat h_ f- 1| >\epsilon n^{-1/10}]=1,\quad and \] \[ \lim_{\epsilon \to 0}\liminf_{n\to \infty}\sup_{f\in F}P_ f[| \Delta (\hat h,f)/\Delta (\hat h_ f,f)-1| >\epsilon n^{-1/5}]=1 \] where F is the class of all densities whose second derivatives exist and are uniformly bounded by a constant \(B>0\).

0 references

zbMATH Keywords

data-driven estimate

0 references

noise

0 references

smoothing parameter selection

0 references

window width

0 references

kernel density estimator

0 references

bandwidth

0 references

mean integrated square error

0 references

least squares

0 references

cross-validation

0 references

upper bounds

0 references

Identifiers

zbMATH Open document ID

0667.62022

0 references

DOI

10.1214/aos/1176350259

0 references

Mathematics Subject Classification ID

0 references

0 references

0 references

Sitelinks

Mathematics(1 entry)

mardi Publication:1117628

Revision as of 15:39, 13 July 2023 Importer (talk \| contribs) Bots 7,080,617 edits ‎Created a new Item	Revision as of 02:38, 31 January 2024 Import240129110113 (talk \| contribs) Bots 7,163,963 edits Added link to MaRDI item. Newer edit →
links / mardi / name	links / mardi / name
		Publication:1117628