Topics of Variable Selection in Biomedical Data Mining

Public Deposited

High-dimensional data are becoming increasingly available in various fields as data collection technology advances. Not only are we interested in knowing which variables are relevant to the response and which are not, but also a simpler model with less predictor variables is easier for interpretation and computational purposes. Furthermore, a simpler model may yield more precise estimates. The role of variable selection methods is becoming more critical. This dissertation studies variable selection methods motivated by two types of biomedical datasets: DNA methylation data and medical cost data. Firstly, we apply the Iterative Sure Independence Screening (ISIS) method with elastic net penalty to study the association between metabolic syndrome and DNA methylation in the Normative Aging Study. To increase power, we create a metabolic syndrome index and construct a binomial model. We demonstrate that the screening step in ISIS can significantly improve the performance of the elastic net. Among 484,548 CpG markers, we identify four CpGs which can be mapped to two biologically relevant and functional genes. DNA methylation markers may mediate pathways linking environmental exposures with health outcomes. We propose a joint significance test for mediation effects using sure independent screening and minimax concave penalty techniques. Using this method, two CpGs are identified with significant mediation effects in the pathway from smoking to reduced lung function in the Normative Aging Study. To deal with DNA methylation levels as multivariate outcomes, we propose a weighted square-root LASSO. We can estimate the regression coefficient matrix in a sparse multivariate regression model accounting for the correlations between high-dimensional responses, whose dimension is larger than the sample size. Our method is tuning-insensitive and has advantages in both variable selection and computational efficiency. Finally, motivated by the medical cost data from the Medical Expenditure Panel Survey dataset, we propose Spike-or-Slab priors for Bayesian variable selection based on asymptotic normal estimates of the full model parameters. The key difficulty in implementing Bayesian methods is the computation. Moreover, medical cost data has several unique statistical issues, such as heteroscedasticity and severe skewness. Our proposed method overcomes these issues, fitting the data robustly without transforming the response variable. In addition, by ranking the predictor variables by Z-statistics, the scope of model searching can be reduced to achieve computing efficiency.

Last modified
  • 01/29/2019
Date created
Resource type
Rights statement