Work

Essays on "Small" Sample Problems in "Large" Datasets

Public

Many estimation and inference procedures rely on asymptotic approximations for quantities that are unknown to researchers. While often convenient, such approximations can be poor in practice, even when the number of observations is ostensibly large. One response is to eschew asymptotics in favor of finite sample bounds. While remarkable progress has been made in this regard, bounds are often wide, or involve unknown parameters that limit their use. This dissertation takes a different approach. Our view is that the failure of asymptotics can often be attributed to certain pathological features of the data that reduce effective sample size, so that data sets may be small for the purpose of asymptotic approximation, even when they are nominally large. Our solution is to develop alternative asymptotic theories that explicitly incorporate said features, so that "small" data problems persist in the limiting approximations, which we expect to be more accurate as a result. We pursue such an approach in three different settings. Chapter 1 studies the properties of linear regression on centrality measures when network data is sparse -- that is, when there are many more agents than links per agent -- and when they are constructed by proxy. Network data contains little information when they are sparse, since the adjacency matrices are mostly zeroes. Conventional analyses, based on taking the number of nodes to infinity, ignore the fact that centrality measures may have no variation when networks are sparse. Instead, we study the theoretical properties of OLS under sequences of increasingly sparse networks, making three contributions: (1) We show that OLS estimators can become inconsistent under sparsity and characterize the threshold at which this occurs, with and without proxy error. This threshold depends on the centrality measure used. Specifically, regression on eigenvector is less robust to sparsity than on degree and diffusion. (2) We develop distributional theory for OLS estimators under proxy error and sparsity, finding that OLS estimators are subject to asymptotic bias even when they are consistent. Moreover, bias can be large relative to the variances, so that bias correction is necessary for inference. (3) We propose novel bias correction and inference methods for OLS with sparse proxy networks. Simulation evidence suggests that our theory and methods perform well, particularly in settings where the usual OLS estimators and heteroskedasticity-consistent/robust $t$-tests are deficient. Finally, we demonstrate the utility of our results in an application inspired by De Weerdt and Dercon (2006), in which we study the relationship between consumption smoothing and informal insurance in Nyakatoke, Tanzania. Chapters 2 and 3 consider the issue of inference with cluster-dependent data. When researchers are concerned about dependence between observations in their datasets, they typically group observations into independent clusters in order to facilitate inference using approximate randomization tests (ART) or tests based on the clustered-covariance estimator (CCE). Because researchers are often willing to make only minimal assumptions about the dependence structure within each cluster, cluster-dependent methods typically have effective sample size equal to the number of clusters, which is low in many empirical settings, even if the total number of observations is a large. To better understand the challenges posed by few clusters, Chapters 2 and 3 study issues in inference with cluster-dependent data in asymptotic frameworks in which the number of clusters are finite in the limit. Chapter 2 proposes a test for the level of clustering. CCE and ART require the cluster structure of the data to be known ex ante. However, researchers often have some choice in clustering their data. As such, a researcher who has chosen to cluster their data at a finer or more disaggregated level may be unsure about their decision, especially given knowledge that observations are independent when clustered at a coarser, or more aggregated level. Chapter 2 proposes a modified randomization test as a robustness check for the chosen level of clustering in a linear regression setting. Existing tests require either the number of coarse clusters or number of fine clusters to be large. Our method is designed for settings with few coarse and fine clusters. While the method is conservative, it has competitive power in settings that may be relevant to empirical work. Chapter 3 (joint with Ivan A. Canay, Deborah Kim and Azeem M. Shaikh) considers issues in the implementation of approximate randomization tests, an inference method explicitly designed for settings with few clusters. We show that the ARTs admit an equivalent implementation based on weighted scores and that the test and confidence intervals are invariant to whether the test statistic is studentized or not. When the test involves scalar parameters, we prove that the confidence intervals formed via test inversion are convex. We also present a novel, exact algorithm for test inversion with scalar parameters, which reliably outperforms grid and bisection search. This chapter is written as a user's guide: we articulate the main requirements underlying the test, emphasizing in particular common pitfalls that researchers may encounter and provide two empirical demonstrations based on Munyo and Rossi (2015) and Meng et al (2015). Finally, Chapter 4 (joint with Ahnaf Rafi) considers the issue of experiment design with the Neyman Allocation, which is used in many papers on experiment design. These papers typically assume that researchers have access to large pilot studies, which may not be realistic. To understand the properties of the Neyman Allocation with small pilots, we study its behavior in a novel asymptotic framework for two-wave experiments in which the pilot size is assumed to be fixed even as the main wave sample size grows. Our analysis shows that the Neyman Allocation can lead to estimates of the ATE with higher asymptotic variance than with (non-adaptive) balanced randomization. That is, even with a large main-wave experiment, the reduction in asymptotic variance that results from the Neyman Allocation depends on the size of the pilot study used for its estimation. We find that the method performs especially poorly compared to balanced randomization when the outcome variable is relatively homoskedastic with respect to treatment status or when it exhibits high kurtosis. We also provide a series of empirical examples showing that these situations arise frequently in practice. Our results therefore suggest that researchers should not use the Neyman Allocation with small pilots, especially in such instances.

Creator
DOI
Subject
Language
Alternate Identifier
Date created
Resource type
Rights statement

Relationships

Items