Characterization and identification of long non-coding RNAs based on feature relationship

Author:

Wang Guangyu123,Yin Hongyan123,Li Boyang4,Yu Chunlei123,Wang Fan12,Xu Xingjian123,Cao Jiabao123,Bao Yiming12,Wang Liguo5ORCID,Abbasi Amir A6,Bajic Vladimir B7ORCID,Ma Lina12,Zhang Zhang123

Affiliation:

1. CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China

2. BIG Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China

3. University of Chinese Academy of Sciences, Beijing, China

4. Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA

5. Division of Biomedical Statistics and Informatics, Mayo Clinic College of Medicine, Rochester, MN, USA

6. National Center for Bioinformatics, Programme of Comparative and Evolutionary Genomics, Faculty of Biological Sciences, Quaid-i-Azam University, Islamabad, Pakistan

7. King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Thuwal, Kingdom of Saudi Arabia

Abstract

Abstract Motivation The significance of long non-coding RNAs (lncRNAs) in many biological processes and diseases has gained intense interests over the past several years. However, computational identification of lncRNAs in a wide range of species remains challenging; it requires prior knowledge of well-established sequences and annotations or species-specific training data, but the reality is that only a limited number of species have high-quality sequences and annotations. Results Here we first characterize lncRNAs in contrast to protein-coding RNAs based on feature relationship and find that the feature relationship between open reading frame length and guanine-cytosine (GC) content presents universally substantial divergence in lncRNAs and protein-coding RNAs, as observed in a broad variety of species. Based on the feature relationship, accordingly, we further present LGC, a novel algorithm for identifying lncRNAs that is able to accurately distinguish lncRNAs from protein-coding RNAs in a cross-species manner without any prior knowledge. As validated on large-scale empirical datasets, comparative results show that LGC outperforms existing algorithms by achieving higher accuracy, well-balanced sensitivity and specificity, and is robustly effective (>90% accuracy) in discriminating lncRNAs from protein-coding RNAs across diverse species that range from plants to mammals. To our knowledge, this study, for the first time, differentially characterizes lncRNAs and protein-coding RNAs based on feature relationship, which is further applied in computational identification of lncRNAs. Taken together, our study represents a significant advance in characterization and identification of lncRNAs and LGC thus bears broad potential utility for computational analysis of lncRNAs in a wide range of species. Availability and implementation LGC web server is publicly available at http://bigd.big.ac.cn/lgc/calculator. The scripts and data can be downloaded at http://bigd.big.ac.cn/biocode/tools/BT000004. Supplementary information Supplementary data are available at Bioinformatics online.

Funder

Strategic Priority Research Programme of the Chinese Academy of Sciences

National Key Research and Development Programme of China

International Partnership Programme of the Chinese Academy of Sciences

National Natural Science Foundation of China

The Open Biodiversity and Health Big Data Initiative of IUBS

The 13th Five-year Informatization Plan of Chinese Academy of Sciences

The King Abdullah University of Science and Technology

KAUST

Base Research Funds

Publisher

Oxford University Press (OUP)

Subject

Computational Mathematics,Computational Theory and Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Statistics and Probability

Reference46 articles.

1. LncRNA-ID: long non-coding RNA IDentification using balanced random forests;Achawanantakun;Bioinformatics,2015

2. Promoter analysis reveals globally differential regulation of human long non-coding RNA and protein-coding genes;Alam;PLoS One,2014

3. FARNA: knowledgebase of inferred functions of non-coding RNA transcripts;Alam;Nucleic Acids Res,2017

4. Database resources of the BIG data center in 2018;Nucleic Acids Res,2018

5. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses;Cabili;Genes Dev,2011

全球学者库

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"全球学者库"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前全球学者库共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2023 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3