Statistical Methods for Assessing Replication: A Meta-Analytic Framework


A replication crisis has enveloped several scientific fields since the early 2000s (see Baker, 2016). This has given rise to improved research and reporting practices (e.g., F. S. Collins & Tabak, 2014), as well as a cottage industry of research into issues of replication and reproducibility (e.g., R. A. Klein et al., 2014). One strand of this research involves independent attempts to replicate scientific findings, which range from one-off replication attempts to efforts where several laboratories conduct the same experiment simultaneously. On its face, analyzing replication studies would seem simple: just check that the replicate studies get the same results. However, published analyses of this type of meta-research reveal an ambiguity about what we mean when we say that studies get the "same results." Much of this research is being conducted under the assumption that, "No single indicator sufficiently describes replication success," (Open Science Collaboration, 2015). As a result, several different analysis methods have been used on the same set of findings that imply somewhat conflicting definitions of replication, and many of these (such as comparing p-values) have limited statistical justification. Further, since analysis methods are not settled, meta-research programs do not appear to have been designed to ensure sufficiently sensitive analyses. Using meta-analysis as a framework, this dissertation seeks to clarify subjective notions about replication and propose relevant analyses. In meta-analysis, a study's results are interpreted in terms of an underlying effect parameter. These parameters may vary across replication studies because of differences in sample composition or experimental contexts. It would seem that successful replications would minimize these differences, and hence have similar (if not identical) effect parameters. Following from this general approach, this dissertation examines methods to design and analyze replication studies. It describes ways to quantify heterogeneity among effect parameters in replication studies and explore the impact of known differences in how replicate studies are carried out, and these methods are demonstrated on a re-analysis of existing data from meta-research programs in psychology. It then outlines a class of null hypothesis tests about replication, examines their large-sample properties, and proposes corrections that are appropriate for smaller studies. Finally, it describes methods for planning optimal ensembles of replicate studies to ensure that they can support sensitive analyses that are cost-effective.

Alternate Identifier
Date created
Resource type
Rights statement