Seqenv: linking sequences to environments through text mining

Author:

Sinclair Lucas1,Ijaz Umer Z.2,Jensen Lars Juhl3,Coolen Marco J.L.4,Gubry-Rangin Cecile5,Chroňáková Alica6,Oulas Anastasis78,Pavloudi Christina8,Schnetzer Julia9,Weimann Aaron10,Ijaz Ali11,Eiler Alexander1,Quince Christopher12,Pafilis Evangelos8

Affiliation:

1. Department of Ecology and Genetics, Limnology, Uppsala University, Uppsala, Sweden

2. Infrastructure and Environment Research Division, School of Engineering, University of Glasgow, Glasgow, United Kingdom

3. The Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark

4. Western Australia Organic and Isotope Geochemistry Centre (WA-OIGC), Department of Chemistry, Curtin University of Technology, Bentley, WA, Australia

5. Institute of Biological & Environmental Sciences, University of Aberdeen, Aberdeen, United Kingdom

6. Institute of Soil Biology, Biology Centre, Czech Academy of Sciences, České Budějovice, Czech Republic

7. Bioinformatics Group, The Cyprus Institute of Neurology and Genetics, Nicosia, Cyprus

8. Institute of Marine Biology Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Heraklion Crete, Greece

9. Department of Molecular Ecology, Microbial Genomics and Bioinformatics Group, Max Planck Institute for Marine Microbiology, Bremen, Germany

10. Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany

11. Hawkesbury Institute for the Environment, University of Western Sydney, Hawkesbury, Sydney, Australia

12. Warwick Medical School, University of Warwick, Warwick, United Kingdom

Abstract

Understanding the distribution of taxa and associated traits across different environments is one of the central questions in microbial ecology. High-throughput sequencing (HTS) studies are presently generating huge volumes of data to address this biogeographical topic. However, these studies are often focused on specific environment types or processes leading to the production of individual, unconnected datasets. The large amounts of legacy sequence data with associated metadata that exist can be harnessed to better place the genetic information found in these surveys into a wider environmental context. Here we introduce a software program, seqenv, to carry out precisely such a task. It automatically performs similarity searches of short sequences against the “nt” nucleotide database provided by NCBI and, out of every hit, extracts–if it is available–the textual metadata field. After collecting all the isolation sources from all the search results, we run a text mining algorithm to identify and parse words that are associated with the Environmental Ontology (EnvO) controlled vocabulary. This, in turn, enables us to determine both in which environments individual sequences or taxa have previously been observed and, by weighted summation of those results, to summarize complete samples. We present two demonstrative applications of seqenv to a survey of ammonia oxidizing archaea as well as to a plankton paleome dataset from the Black Sea. These demonstrate the ability of the tool to reveal novel patterns in HTS and its utility in the fields of environmental source tracking, paleontology, and studies of microbial biogeography. To install seqenv, go to: https://github.com/xapple/seqenv.

Publisher

PeerJ

Subject

General Agricultural and Biological Sciences,General Biochemistry, Genetics and Molecular Biology,General Medicine,General Neuroscience

Reference20 articles.

1. Basic local alignment search tool;Altschul;Journal of Molecular Biology,1990

2. Defining operational taxonomic units using DNA barcode data;Blaxter;Philosophical Transactions of the Royal Society of London. Series B,2005

3. Random forests;Breiman;Machine Learning,2001

4. The environment ontology: contextualising biological and biomedical entities;Buttigieg;Journal of Biomedical Semantics,2013

5. A global network of coexisting microbes from environmental and whole-genome sequence data;Chaffron;Genome Research,2010

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3