Modeling human language is at the very frontier of machine learning and artificial intelligence. Statistical language models are probabilistic models that assign probabilities to sequences of words. For example, topic models are frequently used text-mining tools to organize a vast set of unstructured documents by exploring their theme structure. More...
This dissertation focuses on subgroup identification in longitudinal studies. There are two different but related topics. In chapter two and chapter three, several longitudinal based methods for subgroup identification with enhanced treatment effect are proposed to correct the deficiency in measuring treatment effect by simply using a summary statistic. In...
The advent of next-generation sequencing technologies has greatly promoted the devel- opment of metagenomics, and the analysis of compositional dataset has a wide range of application in this area. Because of the constraint that the sum of species relative abun- dance being 1, many traditional and classical statistical methods cannot...
Randomization is considered the gold standard when it comes to evaluating the effectiveness of interventions, primarily due to its ability to avoid bias. However, in recent years, randomization has been heavily criticized in circumstances where subject randomization may not be ethical. In a randomized controlled trial, patients who are extremely...
A replication crisis has enveloped several scientific fields since the early 2000s (see Baker, 2016). This has given rise to improved research and reporting practices (e.g., F. S. Collins & Tabak, 2014), as well as a cottage industry of research into issues of replication and reproducibility (e.g., R. A. Klein...
The heart of computational materials science lies in providing fundamental insights and understanding of materials behavior and properties across different scales. The significance of this task is highlighted by the Materials Genome Initiative and the emergence of computational tools and frameworks such as materials by design, microstructure sensitive design, and...
Innovations are adopted by individuals and spread to other individuals. They are adopted at different rates, some are never adopted at all, some are abandoned, and some become the new norms. A very extensive evidence-based research and practice paradigm that studies how innovations spread is called diffusion of innovations. This...
The focus of this thesis is on evaluating, designing, and applying statistical methods that elucidate molecular mechanisms by seeking to understand the pathways that contribute to disease. Chapter 1 introduces the field and motivates the work in this thesis. Chapters 2, 3, and 4 describe original work. Chapter 5 recapitulates...
In this thesis we present methods for estimating network metrics via random walk sampling. More specifically, we generalize the Hansen-Hurwitz estimator and the Horvitz-Thompson estimator to estimate the shortest path length distribution (SPLD), closeness centrality ranking, and clustering coefficients of a network. Those are important metrics to a network, but...
This dissertation proposes an oracle efficient estimator in the context of a sparse linear model. Chapter 1 introduces the penalty and the estimator that optimizes a penalized least squares objective. Unlike existing methods, the penalty is differentiable – once, and hence the estimator does not engage in model selection. This...
Computer simulation experiments are commonly used as an inexpensive alternative to real-world experiments to form a metamodel that approximates the input-output relationship of the real-world experiment. The metamodel can be useful for decision making and making predictions for inputs that have not been evaluated yet since it can be evaluated...
The spatial autoregressive model has been widely applied in science, in areas such as economics, public finance, political science, agricultural economics, environmental studies and transportation analyses. The classical spatial autoregressive model is a linear model for describing spatial correlation. In this work, we expand the classical model to include time...
The ever growing desire for accurate estimation and efficient learning necessitates the efforts to quantitatively characterize uncertainties for models. In this thesis, four problems pertaining to uncertainty quantification are discussed: A sequential stopping framework of constructing fixed-precision confidence regions is proposed for a class of multivariate simulation problems where variance...
The advent of sequencing technologies has generated a large amount of biological and medical data. These data such as genetic sequencing data and lab experimental evidence data can help understand critical biomedical problems. This dissertation makes contribution in three different but related applications in biomedical research. In Chapter 2, we...
Gaussian process provides a principled and flexible approach for modeling the response surface or the latent function in many areas, including machine learning, statistics and computer experiment. In literature, Gaussian process models have already demonstrated their effectiveness and usefulness in a variety of applications. In this dissertation, we mainly focus...
Cells are often precisely organized into patterns within developing tissues. This precision must emerge from biochemical processes within, and between cells, that are inherently stochastic. I investigated the impact of stochastic gene expression on self-organized pattern formation, focusing on Senseless (Sens), a key target of Wnt and Notch signaling during...
In the short amount of time that genetic manipulation has been possible through CRISPR technology, myriad applications have been developed. Results from one of the most promising applications of this technology, pooled screens, have shown that single guide RNAs (sgRNAs), RNA sequences used to target specific regions of the genome,...
Commonsense inference is a critical capability of modern artificial intelligence (AI) systems. The machines need commonsense knowledge to perform tasks exactly like human being does. Learning commonsense inference from text has been a long standing challenge in the field of natural language processing due to reporting bias -- people do...
Modern design practices rely more and more on computer simulations due to their low cost compared with physical experiments. However, it is still an elusive task to fully unleash the advantages of the simulation models while mitigating their disadvantages for designing complex engineering systems. In simulation-based design, computer simulation models...
Sequential batches of time-evolving data for a set of persistent identifiable entities (e.g. online shopping behavior by month for a customer ID, or economic figures by year for a collection of countries) can exhibit temporal shifts in their underlying clustering structure. Methods for recovering this evolutionary clustering structure exploit natural...