NORMALIZATION OF RNA DEGRADATION IN RNA-SEQ DATAPublic Deposited
RNA-Sequencing (RNA-Seq) is a powerful high-throughput tool to profile transcriptional activities in cells. The observed read counts can be biased by various factors such that they do not accurately represent the true relative abundance of mRNA transcript abundance. Normalization is a critical step to ensure unbiased comparison of gene expression between samples or conditions. Here we show that the gene-specific heterogeneity of transcript degradation pattern across samples presents a common and major source of unwanted variation, and it may substantially bias the results in gene expression analysis. Most existing normalization approaches focused on global adjustment of systematic bias are ineffective to correct for this bias. We propose a novel method based on nonnegative matrix factorization with over-approximation constraints that allows quantification of RNA degradation of each gene within each sample. The estimated degradation index scores are used to build a pipeline named DegNorm (stands for degradation normalization) to adjust read count for RNA degradation heterogeneity on a gene-by-gene basis while simultaneously controlling sequencing depth. The robust and effective performance of this method is demonstrated in an extensive set of real RNA-Seq data and simulated data.