Statistics Colloquium : Dr. Hoyoung Park
NIH
Abstract: The classification of high-dimensional data is a very important problem that has been studied for a long time. Many studies have proposed linear classifiers based on Fisher’s linear discriminant rule (LDA) which consists of estimating the unknown covariance matrix and the mean vector of each group. In particular, if the data dimension p is larger than the number of observations n (p > n), the sample covariance matrix cannot be a good estimator of the covariance matrix due to the well-known rank deficiency. To solve this problem, many studies proposed methods by modifying the LDA classifier through diagonalization or regularization of covariance matrix. In this paper, we categorize existing methods into three cases and discuss the shortcomings of each method. To compensate for these shortcomings, our baseline idea is that we consider estimation of the high dimensional mean vector and covariance matrix altogether while existing methods focus on shrinkage estimator of either mean vector or covariance matrix. We provide theoretical result that the proposed method is successful in both sparse and dense situations of the mean vector structures. In contrast, some existing methods work well only under specific situations. We also present numerical studies that our methods outperform existing methods through various simulation studies and real data examples such as electroencephalogy (EEG), gene expression microarray, and Spectro datasets.