Selection of variables in two-group discriminant analysis by error rate and Akaike's information criteria (Q1074986)

The author considers two criteria for selecting the ''best'' subset of variables for the linear discriminant function in the case of two p- variate normal populations \(\Pi_ 1\), \(\Pi_ 2\) with different means and a common covariance matrix, the means and the matrix being unknown and are to be estimated by random samples of unequal sizes \(N_ 1\), \(N_ 2.\) One criterion is based on minimizing \textit{G. J. McLachlan's} asymptotic unbiased estimate [Biometrics 36, 501-510 (1980; Zbl 0442.62046)] for the error rate of misclassification \[ M(j)=\Phi [-2^{-1}D_ j+2^{- 1}(k_ j-1)(N_ 1^{-1}+N_ 2^{-1})/D_ j+\quad \{32(N_ 1+N_ 2-2)\}^{-1}D_ j\{4(4k_ j-1)-D^ 2_ j\}] \] where \(D_ j\) is the j-subset sample Mahalanobis distance between \(\Pi_ 1\) and \(\Pi_ 2\), and \(k_ j\) is the dimension of this subset. The other selection criterion is based on a ''no additional information'' model minimizing Akaike's information criterion \[ A(j)=(N_ 1+N_ 2)\log \{1+(p-k_ j)F(j)/(N_ 1+N_ 2-p-1)\}+2(k_ j-p), \] \[ where\quad F(j)=\{(N_ 1+N_ 2-p-1)/(p-k_ j)\}(D^ 2-D^ 2_ j)/\{(N_ 1+N_ 2-2)(N_ 1^{-1\quad}+N_ 2^{-1})+D_ j^ 2\}, \] D being the p-variate Mahalanobis distance. It is shown that the expected error rate is closely related to the no additional information model. The asymptotic distributions and error rate risks of both criteria are obtained and are shown to be identical for these criteria, so in this sense the two criteria considered are asymptotically equivalent.

0 references

zbMATH Keywords

two-group discriminant analysis

0 references

selection of variables

0 references

linear discriminant function

0 references

p-variate normal populations

0 references

different means

0 references