Work

Computational Approaches for Enhanced Metabolomics Annotation and DNA-based Biorecording and Data Storage

Public

Breakthroughs in large-scale biological data collection have resulted in a wealth of -omics (genomics, metabolomics, etc.) datasets in the literature. However, the development of appropriate computational techniques for their analysis is lacking, yet crucial for fully extracting the rich information contained in these datasets. The work in this dissertation describes methods I have developed and successfully applied to help address this need. This dissertation is comprised of one review paper, two major research projects comprised of two papers each, and suggestions for future work. First, I review the biotechnological opportunities and challenges associated with using metabolic kinetic modeling to quantitatively investigate large-scale biological systems. I make recommendations on how best to utilize and improve kinetic modeling to answer important questions regarding these systems. Next, I pivot to my first major project, where I applied a reaction ruleset derived from known biochemical reactions to generate publicly available, biologically reasonable candidate compound sets for metabolomics annotation workflows. In the first half of this project, I showed that these candidate sets outperformed candidate sets previously constructed using a similar method as well as PubChem and KEGG, databases commonly used for metabolomics annotation. In the second half of this project, I developed a more organism-specific version of this workflow that intelligently filters down candidate sets even further by using a machine learning approach that considers both organismal and experimental context. I showed that not only does this workflow perform well, exhibiting high recall and precision, I also found that metabolic modeling was a highly useful tool (which is currently underutilized by the metabolomics community) when selecting high-confidence hits. In an effort to discover unknown biological compounds, I applied this workflow to a literature metabolomics dataset for Acinetobacter baylyi sp. ADP1, a soil microbe. This filtered 4,697 predicted candidate compounds down to just 5 high-confidence compounds, 2 of which are currently unknown to exist in biological systems. Experimental validation of these 5 compounds is in progress at the time of this writing. Next, I describe my second major project, where I worked as the lead computationalist on a team developing enzymatic DNA-based encoding systems for biorecording and data storage applications. In the first half, I describe TURTLES, an in vitro system designed to record changes in metal ion concentration over time enzymatically into DNA. This system expresses Terminal deoxynucleotidyl Transferase (TdT), a DNA polymerase that synthesizes single-stranded DNA. TdT is used to propagate changes in metal ion concentration into changes in base composition of TdT-synthesized DNA as TdT kinetics are influenced by metal binding. Such a system shows potential for high-throughput, high-resolution temporal recording of biological signals such as neural firing (via Ca2+ signaling) in the brain. I developed the pipeline and mathematical framework used to decode changes in metal ion concentration over time from DNA sequencing data. We successfully recorded and decoded changes in biologically significant metal ions such as Mg2+, Co2+, Zn2+, and Ca2+ with minutes resolution, surpassing state-of-the-art techniques reporting temporal resolutions on the order of hours. In the second half of this work, I developed the computational methods to use a modified version of the TURTLES system that stores user-defined data into genomic DNA in vivo. Specifically, we encoded the message “HELLO WORLD” into the genomes of mammalian cells. We successfully decoded all but one character, showing that information can be encoded and decoded into base distributions of DNA in cellular contexts with high accuracy. These two proof-of-concept studies help pave the way for DNA-based recording and storage systems to meet real world needs such as non-invasive in situ recording of biological signals and long-term archival data storage. Computational approaches designed in tandem with experimental techniques producing massive datasets will be essential for robust and informative analyses. This is particularly important as the capacity to generate biological data will only grow. Computational methods such as those described here will enable and accelerate the discovery and identification of novel biological compounds as well as the development of engineered biological systems for real world applications.

Creator
DOI
Subject
Language
Alternate Identifier
Keyword
Date created
Resource type
Rights statement

Relationships

Items