Efficient k-mer based curation of raw sequence data: application inDrosophila suzukii

Author:

Gautier MathieuORCID

Abstract

Several studies have highlighted the presence of contaminated entries in public sequence repositories, calling for special attention to the associated metadata. Here, we propose and evaluate a fast and efficientk–mer-based approach to assess the degree of mislabeling or contamination. We applied it to high-throughput whole-genome raw sequence data for 236 Ind-Seq and 22 Pool-Seq samples of the invasive speciesDrosophila suzukii. We first used CLARK software to build a dictionary of species-discriminatingk–mersfrom the curated assemblies of 29 target drosophilid species (includingD. melanogaster, D. simulans, D. subpulchrella, orD. biarmipes) and 12 common drosophila pathogens and commensals (including Wolbachia). Counting the number ofk–merscomposing each query sample sequence that matched a discriminatingk–merfrom the dictionary provided a simple criterion for assignment to target species and evaluation of the entire sample. Analyses of a wide range of samples, representative of both target and other drosophilid species, demonstrated very good performance of the proposed approach, both in terms of run time and accuracy of sequence assignment. Of the 236D. suzukiiindividuals, five were re-assigned toD. simulansand eleven toD. subpulchrella. Another four showed moderate to substantial microbial contamination. Similarly, among the 22 Pool-Seq samples analyzed, two from the native range were found to be contaminated with 1 and 7D. subpulchrellaindividuals, respectively (out of 50), and one from Europe was found to be contaminated with 5 to 6D. immigransindividuals (out of 100). Overall, the present analysis allowed the definition of a large curated dataset consisting of>60 population samples representative of the worldwide genetic diversity, which may be valuable for further population genetics studies onD. suzukii. More generally, while we advocate careful sample identification and verification prior to sequencing, the proposed framework is simple and computationally efficient enough to be included as a routine post-hoc quality check prior to any data analysis and prior to data submission to public repositories.

Publisher

Cold Spring Harbor Laboratory

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3