This dissertation studies the small dispersion asymptotics in highly stratified models. My goal is to show that accurate inferences are possible even if s, the number of strata, is large while m, the number of observations within each stratum, is small, provided that the model ”fit well” in the term...
Many methods have been proposed for estimating the number, $m_0$ (or the proportion, $\pi_0$), of the true null hypotheses for adaptively controlling a type I error rate (e.g., the false discovery rate or FDR) using a multiple test procedure. Most of these methods eliminate ``significantly" non-null $p$-values. Then $m_0$ is...
Small area estimation (SAE) has been one of the most active areas in survey methodology research, due to the increasing demand for small area statistics from government agencies and the private sector. But in some areas of interest, sample sizes could be very small, or even zero, in which case,...
One of the most commonly used techniques for classification problem is logistic regression. For example, logistic regression for a binary response assumes that the odds Pr(y = 1|x)/Pr(y = 0) = exp(a+bx). However, in reality, the pattern of the data can be so complicated that logistic regression model often fails,...
The use of cluster randomized experiments to study the effects of treatments on groups of subjects has increased in recent years. Many of these experiments lack the necessary statistical power to detect practically meaningful effects of treatment. One method for improving power in cluster randomized experiments that has been advanced...
High-dimensional data are becoming increasingly available in various fields as data collection technology advances. Not only are we interested in knowing which variables are relevant to the response and which are not, but also a simpler model with less predictor variables is easier for interpretation and computational purposes. Furthermore, a...
In recent years, research has been conducted to develop Sequential, Multiple Assignment, Randomized Trial (SMART) designs. These experimental designs were created to aid in the construction of adaptive treatment strategies for individuals, particularly in medical contexts. Simultaneously, research has been done on developing the use of randomized trials to evaluate...
Last two decades have seen a surge of interests in approaches that leverage network structure in machine learning models. For many networks, not only the connections of the network but also the network attributes, such as node attributes and dyadic attributes, are observed. This heterogeneity in networks raises new challenges...
RNA-Sequencing (RNA-Seq) is a powerful high-throughput tool to profile transcriptional activities in cells. The observed read counts can be biased by various factors such that they do not accurately represent the true relative abundance of mRNA transcript abundance. Normalization is a critical step to ensure unbiased comparison of gene expression...
Modeling human language is at the very frontier of machine learning and artificial intelligence. Statistical language models are probabilistic models that assign probabilities to sequences of words. For example, topic models are frequently used text-mining tools to organize a vast set of unstructured documents by exploring their theme structure. More...