Essays on the Sociological Analysis of Segregation and Natural Language

Nanni, Antonio

doi:https://doi.org/10.21985/n2-f4fg-5a55

Work

Essays on the Sociological Analysis of Segregation and Natural Language

Public

Download PDF

Download All Files (.zip)

This dissertation contributes to the theory of segregation and methodologies to measure it. The first two chapters focus on the traditional problem of quantifying segregation in traditional survey data through segregation indices. Segregation indices describe the segregation of an environment with one number – usually from 0 to 1. The last chapter focuses on a new form of data: unstructured textual data. It analyzes the issue of extracting stereotypical cultural schema from this kind of data using the increasingly-popular word embedding models. In the first chapter, we show that segregation indices calculated from samples are biased and unreliable, especially in small samples. Often, researchers use segregation indices on samples to estimate the segregation in a population. Therefore, statistical inference on segregation indices is necessary, but methods to conduct this kind of inference are scarcely available and not generally used. To obviate the problem, the chapter formulates two new general techniques based on non-parametric Bayesian models. The new techniques are applicable to any segregation index or function of segregation indices. To demonstrate their capability, the chapter tests the Bayesian techniques on the D and Theil indices, and the decomposition of the Theil index. Extensive Monte Carlo simulations compare the performances of the new Bayesian techniques with the current standard practice and currently-best available alternative, a bootstrap-based estimator. In all of the simulations, the new techniques provide more reliable inferences than previously achieved. Particularly, the Bayesian techniques appear remarkably more accurate on small samples and in the production of confidence intervals. We recommend using the new Bayesian techniques to conduct inference, especially in smaller samples. The second chapter analyzes the issue of comparing segregation indices. Often, researchers use segregation indices to compare segregation in different environments. However, it is very difficult to interpret the differences in segregation indices between two environments, since traditional indices mixes different phenomena. The chapter formalizes the problem of interpreting change in segregation and builds a new family of indices that is interpretable from this perspective. One of its member, Q, is both interpretable and strongly decomposable, as is the Theil index. To formulate of Q, the paper also provides new results about margin-free indices (Charles and Grusky, 1995). It formulates the only way to build margin-free indices and provides a new solution to the zero problem afflicting these indices. As a result, the chapter also formulates the index Q*, which is the first strongly-decomposable margin-free index. The third chapter analyzes the use of word embedding models in the social sciences. Word embedding models represent each word from a textual corpus as a vector in a multi-dimensional space. They are increasingly popular in the social sciences for their ability to capture cultural schemas from readily-available textual corpora. Sociologists have used word embedding models to study a variety of different issues: from the association of obesity to gender, to the evolution of the concept of social class. A growing literature in computer science and linguistics examines how words become vectors, but fewer works analyze how to extract meaning from such vectors in order to draw social scientific conclusions. The chapter focuses on the theoretical and methodological assumptions governing the latter process. It shows that previous social scientific research relies on a simple model of meaning in word-vectors. Subsequently, it formulates a more general model linking meaning and vectors -- the ``simple algebra of meaning''. The simple algebra of meaning subsumes previous methodologies and paves the way for methodological innovation in the social scientific use of word embedding models. Finally, the chapter draws upon the new model to expand the current uses of word embedding models. It shows how to 1. accommodate non-binary oppositions, 2. analyze entire documents (as opposed to single words), 3. consider more than one concept at the same time, 4. decompose the meaning of documents into a function of the meaning of their words. As an example, the chapter tests the new methodologies on a corpus of 30,228 abstracts about climate change and estimates the Lovecraftian aura of words from publicly-available word embedding.

Creator