Abstract
AbstractCryptococcus neoformans is responsible for life-threatening infections that primarily affect immunocompromised individuals and has an estimated worldwide burden of 220,000 new cases each year—with 180,000 resulting deaths—mostly in sub-Saharan Africa. Surprisingly, little is known about the ecological niches occupied by C. neoformans in nature. To expand our understanding of the distribution and ecological associations of this pathogen we implement a Natural Language Processing approach to better describe the niche of C. neoformans. We use a Latent Dirichlet Allocation model to de novo topic model sets of metagenetic research articles written about varied subjects which either explicitly mention, inadvertently find, or fail to find C. neoformans. These articles are all linked to NCBI Sequence Read Archive datasets of 18S ribosomal RNA and/or Internal Transcribed Spacer gene-regions. The number of topics was determined based on the model coherence score, and articles were assigned to the created topics via a Machine Learning approach with a Random Forest algorithm. Our analysis provides support for a previously suggested linkage between C. neoformans and soils associated with decomposing wood. Our approach, using a search of single-locus metagenetic data, gathering papers connected to the datasets, de novo determination of topics, the number of topics, and assignment of articles to the topics, illustrates how such an analysis pipeline can harness large-scale datasets that are published/available but not necessarily fully analyzed, or whose metadata is not harmonized with other studies. Our approach can be applied to a variety of systems to assert potential evidence of environmental associations.Author SummaryOur finding that C. neoformans is associated with decomposing wood is reinforced by the general literature on C. neoformans and its close congeneric relatives and warrants further investigation. This work demonstrates the potential utility of pairing Natural Language Processing (NLP) with single-locus metagenetic data for the study of Neglected Tropical Diseases. We present a novel method to study the ecological niches of rare pathogens that leverages the immense amount of data available to researchers in the NCBI Sequence Read Archive (SRA)combined with a text-mining analysis based on Natural Language Processing. We demonstrate that text processing, noun identification, and verb identification can play an important role in analyzing a large corpus of documents together with metagenetic data. Forging this connection requires access to all of the available ecological 18S ribosomal RNA and Internal Transcribed Spacer NCBI SRA datasets. These datasets use metabarcoding to query taxonomic diversity in eukaryotic organisms, and in the case of the Internal Transcribed Spacer, they specifically target Fungi. The presence of specific species is inferred when diagnostic 18S or ITS gene region sequences are found in the SRA data. We searched for C. neoformans in all 18S and ITS datasets available and gathered all associated journal articles that either cite the SRA data accessions or are cited in the SRA data accessions.Published metagenetic data often have associated metadata including: latitude and longitude, temperature, and other physical characteristics describing the conditions in which the metagenetic sample was collected. These metadata are not always be presented in consistent formats, so harmonizing study methods may be needed to appropriately compare metagenetic data as commonly required in metanalysis studies. We present an analysis which takes as input articles associated with SRA datasets that were found to contain evidence of C. neoformans. We apply NLP methods to this corpus of articles to describe the niche of C. neoformans. Our results reinforce the current understanding of C. neoformans’s niche, indicating the pertinence of employing a NLP analysis to identify the niche of an organism. This approach could further the description of virtually any other organism that routinely appears in metagenetic surveys, especially pathogens, whose ecological niches are unknown or poorly understood.Optional Striking ImageCryptococcus neoformans cells budding. Image Provided Courtesy of Felipe H. Santiago-Tirado, colored by Kristina Davis, CC-BY 4.0
Publisher
Cold Spring Harbor Laboratory