Work

Mutational Processes Modeling and Early Cancer Diagnosis

Public

Downloadable Content

Download PDF

Annual age-adjusted breast cancer incidence rates in the United States have been static for decades. More recently, the development of massively parallel, high throughput DNA sequencing has enabled the cataloging of somatic mutations in cancer. Mutations are non-random and occur within sequence motifs. These motifs provide us with evidence to infer the processes that created the mutations. Learning the patterns of mutations in these motifs, our understanding of the biological processes that generate somatic mutations in breast cancer has increased markedly. While these provide important insights into the processes responsible for somatic mutations, gaps remain, and the etiology of several signatures remains unknown. One possible explanation for the unknown etiology is due to the robust assumptions made in the signature study. Therefore, reducing the number of assumptions has a potential to decipher the unknown signature, helping us understand the signatures’ etiology. To date, most of the motif changes were studied without insertions and deletions (INDELs) information. However, it is well known that INDELs have a large effect on the genotypes. Therefore, using whole-exome sequencing data, germline and somatic mutations are integrated and all single nucleotide variants, insertions, and deletions are interactively amalgamated as features in a deep learning model. While great strides have been made in the treatment of breast cancer, successful prevention remains elusive. Our understanding of the mutational processes in breast cancer would ultimately improve prevention strategies. Current breast cancer prevention strategies fall into one of three categories: lifestyle modification, surgical intervention, and chemoprevention. These strategies have had, at best, limited success. In our study, we seek for biomarkers of significantly elevated breast cancer risk that can be detected in early cancer period. Nationwide adoption of Electronic Health Records (EHRs) has given rise to a large amount of digital health data, which can be used for secondary analysis. Typical EHRs include structured data such as diagnosis codes, vitals and physiologic measurements, as well as unstructured clinical narratives such as progress notes and discharge summaries. We developed computational phenotyping to automatically mine and predict clinically significant, or scientifically meaningful phenotypes from structured EHR data, unstructured clinical narratives, or their combination.

Creator
DOI
Subject
Language
Alternate Identifier
Keyword
Date created
Resource type
Rights statement

Relationships

Items