Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance

Author:

Timme Ruth E.1,Rand Hugh1,Shumway Martin2,Trees Eija K.3,Simmons Mustafa4,Agarwala Richa2,Davis Steven1,Tillman Glenn E.4,Defibaugh-Chavez Stephanie5,Carleton Heather A.3,Klimke William A.2,Katz Lee S.36

Affiliation:

1. Center for Food Safety and Applied Nutrition, US Food and Drug Administration, College Park, MD, United States of America

2. National Center for Biotechnology Information, National Institutes of Health, Bethesda, MD, United States of America

3. Enteric Diseases Laboratory Branch, Centers for Disease Control and Prevention, Atlanta, GA, United States of America

4. Food Safety and Inspection Service, US Department of Agriculture, Athens, GA, United States of America

5. Food Safety and Inspection Service, US Department of Agriculture, Wahington, D.C., United States of America

6. Center for Food Safety, College of Agricultural and Environmental Sciences, University of Georgia, Griffin, GA, United States of America

Abstract

Background As next generation sequence technology has advanced, there have been parallel advances in genome-scale analysis programs for determining evolutionary relationships as proxies for epidemiological relationship in public health. Most new programs skip traditional steps of ortholog determination and multi-gene alignment, instead identifying variants across a set of genomes, then summarizing results in a matrix of single-nucleotide polymorphisms or alleles for standard phylogenetic analysis. However, public health authorities need to document the performance of these methods with appropriate and comprehensive datasets so they can be validated for specific purposes, e.g., outbreak surveillance. Here we propose a set of benchmark datasets to be used for comparison and validation of phylogenomic pipelines. Methods We identified four well-documented foodborne pathogen events in which the epidemiology was concordant with routine phylogenomic analyses (reference-based SNP and wgMLST approaches). These are ideal benchmark datasets, as the trees, WGS data, and epidemiological data for each are all in agreement. We have placed these sequence data, sample metadata, and “known” phylogenetic trees in publicly-accessible databases and developed a standard descriptive spreadsheet format describing each dataset. To facilitate easy downloading of these benchmarks, we developed an automated script that uses the standard descriptive spreadsheet format. Results Our “outbreak” benchmark datasets represent the four major foodborne bacterial pathogens (Listeria monocytogenes, Salmonella enterica, Escherichia coli, and Campylobacter jejuni) and one simulated dataset where the “known tree” can be accurately called the “true tree”. The downloading script and associated table files are available on GitHub: https://github.com/WGS-standards-and-analysis/datasets. Discussion These five benchmark datasets will help standardize comparison of current and future phylogenomic pipelines, and facilitate important cross-institutional collaborations. Our work is part of a global effort to provide collaborative infrastructure for sequence data and analytic tools—we welcome additional benchmark datasets in our recommended format, and, if relevant, we will add these on our GitHub site. Together, these datasets, dataset format, and the underlying GitHub infrastructure present a recommended path for worldwide standardization of phylogenomic pipelines.

Funder

Center for Food Safety and Applied Nutrition at the Food and Drug Administration

Advanced Molecular Detection (AMD) Initiative at Centers for Disease Control and Prevention

Intramural Research Program of the National Institutes of Health, National Library of Medicine

USDA-FSIS program

Publisher

PeerJ

Subject

General Agricultural and Biological Sciences,General Biochemistry, Genetics and Molecular Biology,General Medicine,General Neuroscience

Reference28 articles.

1. Practical value of food pathogen traceability through building a whole-genome sequencing network and database;Allard;Journal of Clinical Microbiology,2016

2. Automated reconstruction of whole-genome phylogenies from short-sequence reads;Bertels;Molecular Biology and Evolution,2014

3. Multistate outbreak of Salmonella Bareilly and Salmonella Nchanga infections associated with a raw scraped ground tuna product (final update);CDC,2012

4. Multistate outbreak of Shiga toxin-producing Escherichia coli O121 infections linked to raw clover sprouts (final update);CDC,2014

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3