In this dissertation, we aim to develop a theoretical understanding of foundation models and reinforcement learning. We delve into a comprehensive analysis of specific aspects within these domains. The focal points of our study are as follows: • Generative Adversarial Imitation Learning (GAIL) with Neural Networks: GAIL is poised to...
Machine learning and deep learning have been proven successful across various scientific fields, such as computer vision, natural language processing, and recommendation systems. As models become more complex, with more parameters and intricate architectures, they can achieve higher prediction accuracy when trained on larger datasets. However, despite the great prediction...
In the Maximum-a-Posteriori (MAP) Inference problem, for any given probability distribution, the goal is to find the point in the support of that distribution with the highest probability. Potts models and Determinantal Point Processes (DPPs) are probabilistic models that were introduced in the context of statistical physics several decades ago....
With the advancement of high-throughput sequencing technology, it has become much easier to extract gene expression data and to discover gene-disease associations more efficiently. Longitudinal gene expression data offer more insight into expression patterns for distinct patient groups compared to cross-sectional data. For instance, patients diagnosed with subclinical acute rejections...
Deduplication, also referred to as "entity resolution", is a common and crucial pre-processing step in the construction of social networks. Traditional deduplication methods compare the attributes (such as name and age) of potential matching pairs to estimate a match probability for a pair. Recently research has used clustering techniques for...
Literature screening is the process of identifying all relevant records from a pool of candidate paper records in systematic review, meta-analysis, and other research synthesis tasks. This process is time consuming, expensive, and prone to human error. Screening prioritization methods attempt to help reviewers identify most relevant records while only...
Deduplication, also referred to as "entity resolution", is a common and crucial pre-processing step in the construction of social networks. Traditional deduplication methods compare the attributes (such as name and age) of potential matching pairs to estimate a match probability for a pair. Recently research has used clustering techniques for...
Seasonal malaria chemoprevention (SMC) was first recommended by the World Health Organization (WHO) in 2012 to prevent uncomplicated malaria in children and began implementation in Burkina Faso in 2014 under programmatic campaigns. Systematic assessment of the impact of national SMC campaigns requires data with weekly or monthly temporal resolution over...
This thesis develops novel methods for generating space-filling designs inside a designspace and subsampling from a data set. It incorporates materials from two papers by the
author: Shang and Apley 2021; Shang, Apley, and Mehrotra 2022a. Chapter 1 discusses space-filling designs of computer experiments, which is publishedas Shang and Apley...
For stochastic simulation optimization in a modern computing era, we introduce a new parallel framework for solving very large-scale problems using a ranking & selection (R&S) approach that simulates all systems or feasible solutions to provide a global statistical guarantee. We propose a parallel adaptive survivor selection (PASS) framework that...
With the rapid growth of demand for data center services, the energy and water use of data centers has become a critical concern in the contexts of energy use, climate change, and freshwater conservation. Therefore, understanding, quantifying, and optimizing the use of energy and water resources in data centers has...
Sequential change-point detection for time series enables us to sequentially check the hypothesisthat the model still holds as more and more data are observed. It’s widely used in data monitoring
in practice. In this work, we propose two models: Binomial AR(1) model and Generalized
Beta AR(p) model, for modeling binomial...
This dissertation contributes to the theory of segregation and methodologies to measure it. The first two chapters focus on the traditional problem of quantifying segregation in traditional survey data through segregation indices. Segregation indices describe the segregation of an environment with one number – usually from 0 to 1. The...
In recent years, the social sciences have been ensnared in a crisis in which many research findings cannot be replicated (Ioannidis, 2005; Open Science Collaboration, 2015; Camerer et al., 2016; Makel & Plucker, 2014). This crisis has been attributed to a variety of problems including lack of transparency about research...
Language models are the foundation of many natural language tasks such as machine translation, speech recognition, and dialogue systems. Modeling the probability distributions of text accurately helps capture the structures of language and extract valuable information contained in various corpora. In recent years, many advanced models have achieved state-of-the-art performance...
The logistics of policy implementation can lead to a delay from when the actual change in behavior occurs, leading to a shift in a time series. Using change point analysis allows for the data to determine where a change in mean, or other parameters, occurred. But when policy is implemented...
Materials science has been central to human advancement since time immemorial. There has always been curiosity around studying the processes required to extract materials, examine their structure, and ultimately tailor their properties to meet human needs. Over the last few centuries, the ability to tailor material properties was driven by...
This dissertation is a collection of three papers on synthesizing and translating statistical evidence in education research. Chapter 1 serves as an introduction and executive summary, and Chapters 2 - 4 contain the three substantive papers respectively. Chapter 2 presents methods for pooling sample variances across studies to improve properties...
This dissertation consists of three papers on methods for meta-analysis with few studies. These papers are concerned with proper inference from meta-analysis models that combine data from a small number of studies using fixed and random-effects models. Chapter 1 provides an introduction to meta-analysis, the motivation for this work and...
Epigenetics, the study of heritable changes in organisms not caused by mutations to DNA,holds tremendous promise for future medical applications. Although still in its infancy, feature selection in statistics plays an important role in correlating epigenetic changes with
diseases and various health issues. Feature selection may also be used in...