Finding the Needle in the Haystack: Applying Data Science to Address Biological QuestionsPublic Deposited
Biology is entering the exciting world of big data. Modern high-throughput experimental techniques often produce large datasets that aim to capture complex relationships often found in biological systems. While these larger data sets contain vast amounts of useful information, the answers are often locked behind a wall of numbers. As a result, the big data revolution has spawned the field of data science composed of scientific methods, algorithms, and systems to unlock useful information using modern data science tools that blend various tools such as statistical methods, machine learning models, and data visualization pipelines. When applied to new scientific fields, these tools accelerate the discovery and understanding of novel scientific insights.', 'In my thesis, I apply modern data science tools to various biological datasets to investigate the complex relationships and produce actionable insights that inform future experiments. The investigated datasets are united by the common theme of big data and require data science tools to extract useful scientific results. In the first project, I investigate the signal quality of peptide arrays and call attention to the under-studied complexities of peptide behavior in mass spectrometers. For the second project, I extract useful synthesis designs of a potential nanoparticle cancer-immunotherapy, and I expand the capabilities of the synthesis pipeline using supervised machine learning models. The third project creates an improved and automated methodology to systematically label and visualize RNA folding events in SHAPE-Seq datasets. I conclude this thesis by discussing an issue present with many supervised learning models: how do we interpret models? I focus on deep learning interpretation techniques as applied to medical tasks and how these current techniques fall short of emulating clinical practices.