Effect of method of deduplication on estimation of differential gene expression using RNA-seq

Author:

Klepikova Anna V.12,Kasianov Artem S.23,Chesnokov Mikhail S.4,Lazarevich Natalia L.45,Penin Aleksey A.125,Logacheva Maria126

Affiliation:

1. Institute for Information Transmission Problems of the Russian Academy of Sciences, Moscow, Russia

2. A. N. Belozersky Institute of Physico-Chemical Biology, Lomonosov Moscow State University, Moscow, Russia

3. N. I. Vavilov Institute for General Genetics, Moscow, Russia

4. N.N. Blokhin Russian Cancer Research Center of the Ministry of Health of the Russian Federation, Moscow, Russia

5. Department of Biology, Lomonosov Moscow State University, Moscow, Russia

6. Extreme Biology Laboratory, Institute of Fundamental Medicine and Biology, Kazan Federal University, Kazan

Abstract

BackgroundRNA-seq is a useful tool for analysis of gene expression. However, its robustness is greatly affected by a number of artifacts. One of them is the presence of duplicated reads.ResultsTo infer the influence of different methods of removal of duplicated reads on estimation of gene expression in cancer genomics, we analyzed paired samples of hepatocellular carcinoma (HCC) and non-tumor liver tissue. Four protocols of data analysis were applied to each sample: processing without deduplication, deduplication using a method implemented in samtools, and deduplication based on one or two molecular indices (MI). We also analyzed the influence of sequencing layout (single read or paired end) and read length. We found that deduplication without MI greatly affects estimated expression values; this effect is the most pronounced for highly expressed genes.ConclusionThe use of unique molecular identifiers greatly improves accuracy of RNA-seq analysis, especially for highly expressed genes. We developed a set of scripts that enable handling of MI and their incorporation into RNA-seq analysis pipelines. Deduplication without MI affects results of differential gene expression analysis, producing a high proportion of false negative results. The absence of duplicate read removal is biased towards false positives. In those cases where using MI is not possible, we recommend using paired-end sequencing layout.

Funder

Ministry of Education and Science of Russia

Publisher

PeerJ

Subject

General Agricultural and Biological Sciences,General Biochemistry, Genetics and Molecular Biology,General Medicine,General Neuroscience

Reference37 articles.

1. KIAA0101 mRNA expression in the peripheral blood of hepatocellular carcinoma patients: association with some clinicopathological features;Abdelgawad;Clinical Biochemistry,2016

2. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries;Aird;Genome Biology,2011

3. Differential expression analysis for sequence count data;Anders;Genome Biology,2010

4. Illumina technical note “Optimizing cluster density on illumina sequencing systems”;Anon,2016

5. Filtering duplicate reads from 454 pyrosequencing data;Balzer;Bioinformatics,2013

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3