Work

Development of Informatics Methods for Analysis of Complex Gene Regulation and Transcriptome Diversity Data in Cancer

Public

Downloadable Content

Download PDF

Mammalian transcriptional regulation is well-known to be complex and highly context dependent. Different genetic and epigenetic features, including single nucleotide polymorphisms (SNPs) that function as cis- or trans-expression quantitative trait loci (eQTLs), transcription factor (TF) interaction profile with cis-regulatory elements (CREs), methylation of CpG dinucleotide sequences, and histone modification that affects chromatin accessibility, collectively orchestrate spatial and temporal control of gene expression that produces varied transcript products across diverse cell types, tissues and biological conditions. Such mRNA transcripts from the same gene, termed gene isoforms, differentially produced by alternative promoter usage or alternative splicing, are often found to have distinct biological functions. It was discovered that proper maintenance of alternative transcription and splicing is important in many biological processes, while perturbation of these alternative events and aberrant expression of isoforms often leads to diseases. Importantly, such aberrant isoform switching was found to occur in genes linked to all hallmarks of cancer.Although the advent of Next-Generation Sequencing (NGS) technologies has largely enabled massive profiling of gene regulation and expression landscapes across multiple -omics level, there is still a crucial need for advanced algorithms and informatics methods to better model, analyze and translate such wide variety of data into biological insights, taking the context, complexity, and heterogeneity into consideration. Along this “gene regulation – transcript-resolution expression – protein isoform products” axis, this thesis presents methodological improvements at each stage as attempts to address the following important questions: (1) how can we take the context into consideration when modeling gene regulatory sequences, especially when data is limited? (2) how can we use transcript-level signatures, integrated with various complex gene regulatory features, to characterize diseases like cancer, when we have data heterogeneity due to multiple -omics, platforms and batches? (3) how can we computationally evaluate whether certain protein isoforms, as the end products of complex gene regulation, can serve as drug targets? In the first part of this thesis, we present our development of DNABERT, a deep learning-based Natural Language Processing (NLP) model that generically deciphers different types of regulatory sequences in a context-dependent fashion. Having achieved superior performance compared to many published methods, DNABERT supports sequence function and specificities prediction, motif discovery and functional variant prioritization across multiple biological applications. We expect DNABERT to be an excellent tool to help in understanding the widespread perturbation of gene regulation, which produces aberrantly expressed isoforms, in cancer. In the second part of the thesis, we first present our platform-independent cancer subtyping and classification pipeline (PIGExClass) using isoform-level data. We show that isoform-level data captures clinically more relevant ovarian cancer subtypes that helps with patient stratification, and that the isoform signatures are directly translatable across platforms. Next, we present an extended version of this pipeline to multi-modal (platform, batch, -omics type) setting by developing a novel deep learning-based method (DeepMOIS-MC), to further integrate many levels of -omics data and more robustly determine subtypes. In the last part of the thesis, we present our informatics pipeline built based on our curated data to identify protein isoforms as potential cancer drug targets. We conclude that majority of current cancer drugs are not isoform-specific, which might lead to unexpected off-target effects. Together, this thesis focuses on developing important computational methods along different stages of mammalian gene regulation and expression to shed light upon the mechanism of aberrant isoform production in cancer, and how we can exploit the clinical value of these isoforms, with the help of gene regulatory data, in terms of patient stratification and drug discovery.

Creator
DOI
Subject
Language
Alternate Identifier
Keyword
Date created
Resource type
Rights statement

Relationships

Items