In the Maximum-a-Posteriori (MAP) Inference problem, for any given probability distribution, the goal is to find the point in the support of that distribution with the highest probability. Potts models and Determinantal Point Processes (DPPs) are probabilistic models that were introduced in the context of statistical physics several decades ago....
Modeling human language is at the very frontier of machine learning and artificial intelligence. Statistical language models are probabilistic models that assign probabilities to sequences of words. For example, topic models are frequently used text-mining tools to organize a vast set of unstructured documents by exploring their theme structure. More...
For stochastic simulation optimization in a modern computing era, we introduce a new parallel framework for solving very large-scale problems using a ranking & selection (R&S) approach that simulates all systems or feasible solutions to provide a global statistical guarantee. We propose a parallel adaptive survivor selection (PASS) framework that...
With the advancement of high-throughput sequencing technology, it has become much easier to extract gene expression data and to discover gene-disease associations more efficiently. Longitudinal gene expression data offer more insight into expression patterns for distinct patient groups compared to cross-sectional data. For instance, patients diagnosed with subclinical acute rejections...
In this dissertation, we aim to develop a theoretical understanding of foundation models and reinforcement learning. We delve into a comprehensive analysis of specific aspects within these domains. The focal points of our study are as follows: • Generative Adversarial Imitation Learning (GAIL) with Neural Networks: GAIL is poised to...
The ever growing desire for accurate estimation and efficient learning necessitates the efforts to quantitatively characterize uncertainties for models. In this thesis, four problems pertaining to uncertainty quantification are discussed: A sequential stopping framework of constructing fixed-precision confidence regions is proposed for a class of multivariate simulation problems where variance...
With the rapid growth of demand for data center services, the energy and water use of data centers has become a critical concern in the contexts of energy use, climate change, and freshwater conservation. Therefore, understanding, quantifying, and optimizing the use of energy and water resources in data centers has...
Cells are often precisely organized into patterns within developing tissues. This precision must emerge from biochemical processes within, and between cells, that are inherently stochastic. I investigated the impact of stochastic gene expression on self-organized pattern formation, focusing on Senseless (Sens), a key target of Wnt and Notch signaling during...
This dissertation focuses on subgroup identification in longitudinal studies. There are two different but related topics. In chapter two and chapter three, several longitudinal based methods for subgroup identification with enhanced treatment effect are proposed to correct the deficiency in measuring treatment effect by simply using a summary statistic. In...
Sequential change-point detection for time series enables us to sequentially check the hypothesisthat the model still holds as more and more data are observed. It’s widely used in data monitoring
in practice. In this work, we propose two models: Binomial AR(1) model and Generalized
Beta AR(p) model, for modeling binomial...
This dissertation contributes to the theory of segregation and methodologies to measure it. The first two chapters focus on the traditional problem of quantifying segregation in traditional survey data through segregation indices. Segregation indices describe the segregation of an environment with one number – usually from 0 to 1. The...
The advent of next-generation sequencing technologies has greatly promoted the devel- opment of metagenomics, and the analysis of compositional dataset has a wide range of application in this area. Because of the constraint that the sum of species relative abun- dance being 1, many traditional and classical statistical methods cannot...
In the short amount of time that genetic manipulation has been possible through CRISPR technology, myriad applications have been developed. Results from one of the most promising applications of this technology, pooled screens, have shown that single guide RNAs (sgRNAs), RNA sequences used to target specific regions of the genome,...
Randomization is considered the gold standard when it comes to evaluating the effectiveness of interventions, primarily due to its ability to avoid bias. However, in recent years, randomization has been heavily criticized in circumstances where subject randomization may not be ethical. In a randomized controlled trial, patients who are extremely...
A replication crisis has enveloped several scientific fields since the early 2000s (see Baker, 2016). This has given rise to improved research and reporting practices (e.g., F. S. Collins & Tabak, 2014), as well as a cottage industry of research into issues of replication and reproducibility (e.g., R. A. Klein...
Commonsense inference is a critical capability of modern artificial intelligence (AI) systems. The machines need commonsense knowledge to perform tasks exactly like human being does. Learning commonsense inference from text has been a long standing challenge in the field of natural language processing due to reporting bias -- people do...
The advent of sequencing technologies has generated a large amount of biological and medical data. These data such as genetic sequencing data and lab experimental evidence data can help understand critical biomedical problems. This dissertation makes contribution in three different but related applications in biomedical research. In Chapter 2, we...
Modern design practices rely more and more on computer simulations due to their low cost compared with physical experiments. However, it is still an elusive task to fully unleash the advantages of the simulation models while mitigating their disadvantages for designing complex engineering systems. In simulation-based design, computer simulation models...
The heart of computational materials science lies in providing fundamental insights and understanding of materials behavior and properties across different scales. The significance of this task is highlighted by the Materials Genome Initiative and the emergence of computational tools and frameworks such as materials by design, microstructure sensitive design, and...
Seasonal malaria chemoprevention (SMC) was first recommended by the World Health Organization (WHO) in 2012 to prevent uncomplicated malaria in children and began implementation in Burkina Faso in 2014 under programmatic campaigns. Systematic assessment of the impact of national SMC campaigns requires data with weekly or monthly temporal resolution over...
This thesis develops novel methods for generating space-filling designs inside a designspace and subsampling from a data set. It incorporates materials from two papers by the
author: Shang and Apley 2021; Shang, Apley, and Mehrotra 2022a. Chapter 1 discusses space-filling designs of computer experiments, which is publishedas Shang and Apley...
Sequential batches of time-evolving data for a set of persistent identifiable entities (e.g. online shopping behavior by month for a customer ID, or economic figures by year for a collection of countries) can exhibit temporal shifts in their underlying clustering structure. Methods for recovering this evolutionary clustering structure exploit natural...
Deduplication, also referred to as "entity resolution", is a common and crucial pre-processing step in the construction of social networks. Traditional deduplication methods compare the attributes (such as name and age) of potential matching pairs to estimate a match probability for a pair. Recently research has used clustering techniques for...
Innovations are adopted by individuals and spread to other individuals. They are adopted at different rates, some are never adopted at all, some are abandoned, and some become the new norms. A very extensive evidence-based research and practice paradigm that studies how innovations spread is called diffusion of innovations. This...
Machine learning and deep learning have been proven successful across various scientific fields, such as computer vision, natural language processing, and recommendation systems. As models become more complex, with more parameters and intricate architectures, they can achieve higher prediction accuracy when trained on larger datasets. However, despite the great prediction...
Supervised learning model is one of the most fundamental machine learning models. It can provide powerful capability of prediction by learning complex patterns hidden in many, sometimes thousands, predictors. It can also be used as a building block of other machine learning tasks, like unsupervised learning and reinforcement learning. Such...
Gaussian process provides a principled and flexible approach for modeling the response surface or the latent function in many areas, including machine learning, statistics and computer experiment. In literature, Gaussian process models have already demonstrated their effectiveness and usefulness in a variety of applications. In this dissertation, we mainly focus...
In recent years, the social sciences have been ensnared in a crisis in which many research findings cannot be replicated (Ioannidis, 2005; Open Science Collaboration, 2015; Camerer et al., 2016; Makel & Plucker, 2014). This crisis has been attributed to a variety of problems including lack of transparency about research...
Epigenetics, the study of heritable changes in organisms not caused by mutations to DNA,holds tremendous promise for future medical applications. Although still in its infancy, feature selection in statistics plays an important role in correlating epigenetic changes with
diseases and various health issues. Feature selection may also be used in...
Deduplication, also referred to as "entity resolution", is a common and crucial pre-processing step in the construction of social networks. Traditional deduplication methods compare the attributes (such as name and age) of potential matching pairs to estimate a match probability for a pair. Recently research has used clustering techniques for...
The focus of this thesis is on evaluating, designing, and applying statistical methods that elucidate molecular mechanisms by seeking to understand the pathways that contribute to disease. Chapter 1 introduces the field and motivates the work in this thesis. Chapters 2, 3, and 4 describe original work. Chapter 5 recapitulates...
This dissertation is a collection of three papers on synthesizing and translating statistical evidence in education research. Chapter 1 serves as an introduction and executive summary, and Chapters 2 - 4 contain the three substantive papers respectively. Chapter 2 presents methods for pooling sample variances across studies to improve properties...
In this thesis we present methods for estimating network metrics via random walk sampling. More specifically, we generalize the Hansen-Hurwitz estimator and the Horvitz-Thompson estimator to estimate the shortest path length distribution (SPLD), closeness centrality ranking, and clustering coefficients of a network. Those are important metrics to a network, but...
This dissertation proposes an oracle efficient estimator in the context of a sparse linear model. Chapter 1 introduces the penalty and the estimator that optimizes a penalized least squares objective. Unlike existing methods, the penalty is differentiable – once, and hence the estimator does not engage in model selection. This...
Language models are the foundation of many natural language tasks such as machine translation, speech recognition, and dialogue systems. Modeling the probability distributions of text accurately helps capture the structures of language and extract valuable information contained in various corpora. In recent years, many advanced models have achieved state-of-the-art performance...
The logistics of policy implementation can lead to a delay from when the actual change in behavior occurs, leading to a shift in a time series. Using change point analysis allows for the data to determine where a change in mean, or other parameters, occurred. But when policy is implemented...
Materials science has been central to human advancement since time immemorial. There has always been curiosity around studying the processes required to extract materials, examine their structure, and ultimately tailor their properties to meet human needs. Over the last few centuries, the ability to tailor material properties was driven by...
This dissertation consists of three papers on methods for meta-analysis with few studies. These papers are concerned with proper inference from meta-analysis models that combine data from a small number of studies using fixed and random-effects models. Chapter 1 provides an introduction to meta-analysis, the motivation for this work and...
Literature screening is the process of identifying all relevant records from a pool of candidate paper records in systematic review, meta-analysis, and other research synthesis tasks. This process is time consuming, expensive, and prone to human error. Screening prioritization methods attempt to help reviewers identify most relevant records while only...
Computer simulation experiments are commonly used as an inexpensive alternative to real-world experiments to form a metamodel that approximates the input-output relationship of the real-world experiment. The metamodel can be useful for decision making and making predictions for inputs that have not been evaluated yet since it can be evaluated...
The spatial autoregressive model has been widely applied in science, in areas such as economics, public finance, political science, agricultural economics, environmental studies and transportation analyses. The classical spatial autoregressive model is a linear model for describing spatial correlation. In this work, we expand the classical model to include time...
This dissertation studies the small dispersion asymptotics in highly stratified models. My goal is to show that accurate inferences are possible even if s, the number of strata, is large while m, the number of observations within each stratum, is small, provided that the model ”fit well” in the term...
Many methods have been proposed for estimating the number, $m_0$ (or the proportion, $\pi_0$), of the true null hypotheses for adaptively controlling a type I error rate (e.g., the false discovery rate or FDR) using a multiple test procedure. Most of these methods eliminate ``significantly" non-null $p$-values. Then $m_0$ is...
Small area estimation (SAE) has been one of the most active areas in survey methodology research, due to the increasing demand for small area statistics from government agencies and the private sector. But in some areas of interest, sample sizes could be very small, or even zero, in which case,...
One of the most commonly used techniques for classification problem is logistic regression. For example, logistic regression for a binary response assumes that the odds Pr(y = 1|x)/Pr(y = 0) = exp(a+bx). However, in reality, the pattern of the data can be so complicated that logistic regression model often fails,...
The use of cluster randomized experiments to study the effects of treatments on groups of subjects has increased in recent years. Many of these experiments lack the necessary statistical power to detect practically meaningful effects of treatment. One method for improving power in cluster randomized experiments that has been advanced...
In recent years, research has been conducted to develop Sequential, Multiple Assignment, Randomized Trial (SMART) designs. These experimental designs were created to aid in the construction of adaptive treatment strategies for individuals, particularly in medical contexts. Simultaneously, research has been done on developing the use of randomized trials to evaluate...
High-dimensional data are becoming increasingly available in various fields as data collection technology advances. Not only are we interested in knowing which variables are relevant to the response and which are not, but also a simpler model with less predictor variables is easier for interpretation and computational purposes. Furthermore, a...
Last two decades have seen a surge of interests in approaches that leverage network structure in machine learning models. For many networks, not only the connections of the network but also the network attributes, such as node attributes and dyadic attributes, are observed. This heterogeneity in networks raises new challenges...
RNA-Sequencing (RNA-Seq) is a powerful high-throughput tool to profile transcriptional activities in cells. The observed read counts can be biased by various factors such that they do not accurately represent the true relative abundance of mRNA transcript abundance. Normalization is a critical step to ensure unbiased comparison of gene expression...