Doctoral Dissertation Defense: Moumita Karmakar
Advisor: Dr. Adragni
Nowadays researchers are routinely collecting large amount of data for which the number of predictors p is often too large to allow a thorough graphical visualization of the data for a sound regression modeling. It is also observed that regression data are collected jointly on (Y;X) where X = (X1, ..., Xp) is a random p-vector and Y is a univariate response.
In high dimensional setup, frequently encountered problems for variable selection or estimation in regression analyses are i) nonlinear relationship among predictors and response, ii) number of predictors much larger than sample size, iii) presence of sparsity and collinearity. In such situations, it can be useful to reduce the dimensionality of the predictor space that best depicts the information needed to explain the response under consideration.
Principal tted component (PFC; Cook, 2007) models are a class of likelihood-based inverse regression methods that yield a so-called sufficient reduction of the random p-vector of predictors X given the response Y. Three methodologies based on PFC models are presented: (1) Consistency of a P-value guided variable selection method (PFC-pv; Adragni & Xi, 2015), (2) Variable selection based on PFC (PFC-lrt) in presence of nonlinearity and sparsity, (3) Bayesian estimation of the dimension reduction subspace under PFC model.
PFC-pv is a P-value guided hard-thresholding approach for variable selection based on PFC. Encouraging simulation studies suggest a possible selection consistency of the variable selection procedure. Our approach to prove the consistency is primarily based on fixed sequence of significance levels () contrary to data driven choice of in the proposed PFC-pv method. In addition, we explore the dynamics of sample size, number of predictors and significance level on variable selection. When a non-linear relationship is suspected and a possibly large number of predictors are irrelevant, the accuracy of sufficient reduction is hindered. The proposed PFC-lrt method is a novel approach for variable selection in high dimensions when the relationship between the active predictors and the response is nonlinear. PFC-lrt adapts a sequential likelihood ratio test to the PFC to obtain a \pruned" sufficient reduction. The resulting reduction has an improved accuracy, allows the accurate identification of the important predictors and also provides a sparse estimate of reduction matrix.
In the third part, we develop a fully Bayesian estimation of the parameters in the PFC model using proper prior distribution on both Stiefel and Grassmann manifold for the reduction matrix. Efficient Gibbs samplers are developed and the efficacy of the Bayes estimate is illustrated through simulations.