A Machine-Learning-Based Approach to Prediction of Biogeographic Ancestry within Europe

Author:

Kloska Anna12,Giełczyk Agata3ORCID,Grzybowski Tomasz1,Płoski Rafał4ORCID,Kloska Sylwester M.12ORCID,Marciniak Tomasz3,Pałczyński Krzysztof3ORCID,Rogalla-Ładniak Urszula1,Malyarchuk Boris A.5ORCID,Derenko Miroslava V.5,Kovačević-Grujičić Nataša6ORCID,Stevanović Milena678ORCID,Drakulić Danijela6ORCID,Davidović Slobodan9ORCID,Spólnicka Magdalena10,Zubańska Magdalena11,Woźniak Marcin1ORCID

Affiliation:

1. Department of Forensic Medicine, The Ludwik Rydygier Collegium Medicum in Bydgoszcz, Nicolaus Copernicus University in Torun, 85067 Bydgoszcz, Poland

2. Faculty of Medical Sciences, Bydgoszcz University of Science and Technology, 85796 Bydgoszcz, Poland

3. Faculty of Telecommunications, Computer Science and Electrical Engineering, Bydgoszcz University of Science and Technology, 85796 Bydgoszcz, Poland

4. Department of Medical Genetics, Warsaw Medical University, 02106 Warsaw, Poland

5. Institute of Biological Problems of the North, Russian Academy of Sciences, 685000 Magadan, Russia

6. Institute of Molecular Genetics and Genetic Engineering, University of Belgrade, 11042 Belgrade, Serbia

7. Faculty of Biology, University of Belgrade, 11000 Belgrade, Serbia

8. Serbian Academy of Sciences and Arts, 11000 Belgrade, Serbia

9. Institute for Biological Research “Siniša Stanković”, National Institute of Republic of Serbia, University of Belgrade, 11060 Belgrade, Serbia

10. Center of Forensic Sicences, University of Warsaw, 00927 Warsaw, Poland

11. Faculty of Law and Administration, Department of Criminology and Forensic Sciences, University of Warmia and Mazury, 10726 Olsztyn, Poland

Abstract

Data obtained with the use of massive parallel sequencing (MPS) can be valuable in population genetics studies. In particular, such data harbor the potential for distinguishing samples from different populations, especially from those coming from adjacent populations of common origin. Machine learning (ML) techniques seem to be especially well suited for analyzing large datasets obtained using MPS. The Slavic populations constitute about a third of the population of Europe and inhabit a large area of the continent, while being relatively closely related in population genetics terms. In this proof-of-concept study, various ML techniques were used to classify DNA samples from Slavic and non-Slavic individuals. The primary objective of this study was to empirically evaluate the feasibility of discerning the genetic provenance of individuals of Slavic descent who exhibit genetic similarity, with the overarching goal of categorizing DNA specimens derived from diverse Slavic population representatives. Raw sequencing data were pre-processed, to obtain a 1200 character-long binary vector. A total of three classifiers were used—Random Forest, Support Vector Machine (SVM), and XGBoost. The most-promising results were obtained using SVM with a linear kernel, with 99.9% accuracy and F1-scores of 0.9846–1.000 for all classes.

Funder

National Centre for Research and Development

Ministry of Science, Technological Development and Innovation of the Republic of Serbia

Publisher

MDPI AG

Subject

Inorganic Chemistry,Organic Chemistry,Physical and Theoretical Chemistry,Computer Science Applications,Spectroscopy,Molecular Biology,General Medicine,Catalysis

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3