From contigs towards chromosomes: automatic improvement of long read assemblies (ILRA)

Author:

Ruiz José Luis1ORCID,Reimering Susanne2,Escobar-Prieto Juan David3,Brancucci Nicolas M B456,Echeverry Diego F37ORCID,Abdi Abdirahman I8ORCID,Marti Matthias4ORCID,Gómez-Díaz Elena1ORCID,Otto Thomas D4ORCID

Affiliation:

1. Consejo Superior de Investigaciones Científicas Instituto de Parasitología y Biomedicina López-Neyra (IPBLN), , 18016, Granada , Spain

2. Helmholtz Centre for Infection Research Department for Computational Biology of Infection Research, , Braunschweig , Germany

3. Centro Internacional de Entrenamiento e Investigaciones Médicas (CIDEIM) , Cali , Colombia

4. University of Glasgow School of Infection & Immunity, MVLS, , Glasgow , UK

5. Swiss Tropical and Public Health Institute Department of Medical Parasitology and Infection Biology, , 4123 Allschwil , Switzerland

6. University of Basel , 4001 Basel , Switzerland

7. Universidad del Valle Departamento de Microbiología, Facultad de Salud, , Cali , Colombia

8. KEMRI-Wellcome Trust Research Programme, CGMRC , Kilifi , Kenya

Abstract

Abstract Recent advances in long read technologies not only enable large consortia to aim to sequence all eukaryotes on Earth, but they also allow individual laboratories to sequence their species of interest with relatively low investment. Long read technologies embody the promise of overcoming scaffolding problems associated with repeats and low complexity sequences, but the number of contigs often far exceeds the number of chromosomes and they may contain many insertion and deletion errors around homopolymer tracts. To overcome these issues, we have implemented the ILRA pipeline to correct long read-based assemblies. Contigs are first reordered, renamed, merged, circularized, or filtered if erroneous or contaminated. Illumina short reads are used subsequently to correct homopolymer errors. We successfully tested our approach by improving the genome sequences of Homo sapiens, Trypanosoma brucei, and Leptosphaeria spp., and by generating four novel Plasmodium falciparum assemblies from field samples. We found that correcting homopolymer tracts reduced the number of genes incorrectly annotated as pseudogenes, but an iterative approach seems to be required to correct more sequencing errors. In summary, we describe and benchmark the performance of our new tool, which improved the quality of novel long read assemblies up to 1 Gbp. The pipeline is available at GitHub: https://github.com/ThomasDOtto/ILRA.

Funder

Severo Ochoa Fellowship

La Caixa Foundation—Health Research Program

Spanish Ministry of Science and Innovation

Wellcome Trust

Publisher

Oxford University Press (OUP)

Subject

Molecular Biology,Information Systems

Cited by 2 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3