Combining Natural Language Processing and Metabarcoding to Reveal Pathogen-Environment Associations

Author:

Molik David C.ORCID,Tomlinson DeAndreORCID,Davitt ShaneORCID,Morgan Eric L.ORCID,Roche BenjaminORCID,Meyers Natalie,Pfrender Michael E.ORCID

Abstract

AbstractCryptococcus neoformans is responsible for life-threatening infections that primarily affect immunocompromised individuals and has an estimated worldwide burden of 220,000 new cases each year—with 180,000 resulting deaths—mostly in sub-Saharan Africa. Surprisingly, little is known about the ecological niches occupied by C. neoformans in nature. To expand our understanding of the distribution and ecological associations of this pathogen we implement a Natural Language Processing approach to better describe the niche of C. neoformans. We use a Latent Dirichlet Allocation model to de novo topic model sets of metagenetic research articles written about varied subjects which either explicitly mention, inadvertently find, or fail to find C. neoformans. These articles are all linked to NCBI Sequence Read Archive datasets of 18S ribosomal RNA and/or Internal Transcribed Spacer gene-regions. The number of topics was determined based on the model coherence score, and articles were assigned to the created topics via a Machine Learning approach with a Random Forest algorithm. Our analysis provides support for a previously suggested linkage between C. neoformans and soils associated with decomposing wood. Our approach, using a search of single-locus metagenetic data, gathering papers connected to the datasets, de novo determination of topics, the number of topics, and assignment of articles to the topics, illustrates how such an analysis pipeline can harness large-scale datasets that are published/available but not necessarily fully analyzed, or whose metadata is not harmonized with other studies. Our approach can be applied to a variety of systems to assert potential evidence of environmental associations.Author SummaryOur finding that C. neoformans is associated with decomposing wood is reinforced by the general literature on C. neoformans and its close congeneric relatives and warrants further investigation. This work demonstrates the potential utility of pairing Natural Language Processing (NLP) with single-locus metagenetic data for the study of Neglected Tropical Diseases. We present a novel method to study the ecological niches of rare pathogens that leverages the immense amount of data available to researchers in the NCBI Sequence Read Archive (SRA)combined with a text-mining analysis based on Natural Language Processing. We demonstrate that text processing, noun identification, and verb identification can play an important role in analyzing a large corpus of documents together with metagenetic data. Forging this connection requires access to all of the available ecological 18S ribosomal RNA and Internal Transcribed Spacer NCBI SRA datasets. These datasets use metabarcoding to query taxonomic diversity in eukaryotic organisms, and in the case of the Internal Transcribed Spacer, they specifically target Fungi. The presence of specific species is inferred when diagnostic 18S or ITS gene region sequences are found in the SRA data. We searched for C. neoformans in all 18S and ITS datasets available and gathered all associated journal articles that either cite the SRA data accessions or are cited in the SRA data accessions.Published metagenetic data often have associated metadata including: latitude and longitude, temperature, and other physical characteristics describing the conditions in which the metagenetic sample was collected. These metadata are not always be presented in consistent formats, so harmonizing study methods may be needed to appropriately compare metagenetic data as commonly required in metanalysis studies. We present an analysis which takes as input articles associated with SRA datasets that were found to contain evidence of C. neoformans. We apply NLP methods to this corpus of articles to describe the niche of C. neoformans. Our results reinforce the current understanding of C. neoformans’s niche, indicating the pertinence of employing a NLP analysis to identify the niche of an organism. This approach could further the description of virtually any other organism that routinely appears in metagenetic surveys, especially pathogens, whose ecological niches are unknown or poorly understood.Optional Striking ImageCryptococcus neoformans cells budding. Image Provided Courtesy of Felipe H. Santiago-Tirado, colored by Kristina Davis, CC-BY 4.0

Publisher

Cold Spring Harbor Laboratory

Reference50 articles.

1. On the Mode of Communication of Cholera;Edinb Med J,1856

2. A rivalry of foulness: official and unofficial investigations of the London cholera epidemic of 1854.

3. Global Climate and Infectious Disease: The Cholera Paradigm

4. Vaccines in the time of cholera

5. Contributo alla morfologia e biologia dei blastomiceti che si sviluppano nei succhi di alcuni frutti;Ann Ig,1894

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3