Doctoral Dissertation Defense: Iris Ivy Gauran
Advisors Dr. Junyong Park and DoHwan Park
Monday, April 16, 2018 · 10 AM - 12 PM
Title: Multiple Testing Procedures controlling False Discovery Rate with applications to genomic data
Abstract
In recent mutation studies, analyses based on protein domain positions are gaining popularity over traditional gene-centric approaches since the latter have limitations in considering the functional context that the position of the mutation provides. This presents a large-scale simultaneous inference problem, with hundreds of hypothesis tests to consider at the same time. The overarching objective of this thesis is to propose different multiple testing procedures which can address the problems posed by discrete genomic data. Specifically, we are interested in identifying significant mutation counts while controlling a given level of Type I error via False Discovery Rate (FDR) procedures. One main assumption is that the mutation counts follow a zero-inflated model in order to account for the true zeros in the count model and the excess zeros. The class of models considered is the Zero-inflated Generalized Poisson (ZIGP) distribution.
In the first study, we developed an Empirical Bayes procedure. We assumed that there exists a cut-off value such that smaller counts than this value are generated from the null distribution. We present several data-dependent methods to determine the cut-off value. We also consider a two-stage procedure based on screening process so that the number of mutations exceeding a certain value should be considered as significant mutations. Simulated and protein domain data sets are used to illustrate this procedure in estimation of the empirical null using a mixture of discrete distributions. Overall, while maintaining control of the FDR, the proposed cut-off method juxtaposed with the two-stage testing procedure has superior empirical power.
In the second study, we developed full Bayesian procedures. We addressed the caveat of the Empirical Bayes procedure by proposing methods which can handle both the weakened assumption on the null distribution and the sparsity condition which is apparent among protein domains whose number of positions is considerably small. Based on the simulation studies, the full Bayesian methods have the ability to control FDR when the Empirical Bayes method fails. We also studied several cases in order to assess whether we need to implement the zero assumption on the null distribution. Results revealed that implementing this key assumption would still yield good results in terms of control of FDR and high values of the empirical power. In general, simulation results suggest that lesser number of rejections is preferable. The number of identified hotspots in the real data analysis are consistent with the simulation studies.
Abstract
In recent mutation studies, analyses based on protein domain positions are gaining popularity over traditional gene-centric approaches since the latter have limitations in considering the functional context that the position of the mutation provides. This presents a large-scale simultaneous inference problem, with hundreds of hypothesis tests to consider at the same time. The overarching objective of this thesis is to propose different multiple testing procedures which can address the problems posed by discrete genomic data. Specifically, we are interested in identifying significant mutation counts while controlling a given level of Type I error via False Discovery Rate (FDR) procedures. One main assumption is that the mutation counts follow a zero-inflated model in order to account for the true zeros in the count model and the excess zeros. The class of models considered is the Zero-inflated Generalized Poisson (ZIGP) distribution.
In the first study, we developed an Empirical Bayes procedure. We assumed that there exists a cut-off value such that smaller counts than this value are generated from the null distribution. We present several data-dependent methods to determine the cut-off value. We also consider a two-stage procedure based on screening process so that the number of mutations exceeding a certain value should be considered as significant mutations. Simulated and protein domain data sets are used to illustrate this procedure in estimation of the empirical null using a mixture of discrete distributions. Overall, while maintaining control of the FDR, the proposed cut-off method juxtaposed with the two-stage testing procedure has superior empirical power.
In the second study, we developed full Bayesian procedures. We addressed the caveat of the Empirical Bayes procedure by proposing methods which can handle both the weakened assumption on the null distribution and the sparsity condition which is apparent among protein domains whose number of positions is considerably small. Based on the simulation studies, the full Bayesian methods have the ability to control FDR when the Empirical Bayes method fails. We also studied several cases in order to assess whether we need to implement the zero assumption on the null distribution. Results revealed that implementing this key assumption would still yield good results in terms of control of FDR and high values of the empirical power. In general, simulation results suggest that lesser number of rejections is preferable. The number of identified hotspots in the real data analysis are consistent with the simulation studies.