PHA4GE quality control contextual data tags: standardized annotations for sharing public health sequence datasets with known quality issues to facilitate testing and training

Author:

Griffiths Emma J.1ORCID,Mendes Inês2ORCID,Maguire Finlay3ORCID,Guthrie Jennifer L.4,Wee Bryan A.5ORCID,Schmedes Sarah6ORCID,Holt Kathryn7,Yadav Chanchal7,Cameron Rhiannon1ORCID,Barclay Charlotte1,Dooley Damion1ORCID,MacCannell Duncan6ORCID,Chindelevitch Leonid89,Karsch-Mizrachi Ilene10ORCID,Waheed Zahra11,Katz Lee12ORCID,Petit III Robert13,Dave Mugdha14,Oluniyi Paul15ORCID,Nasar Muhammad Ibtisam16,Raphenya Amogelang17ORCID,Hsiao William W. L.1ORCID,Timme Ruth E.18ORCID

Affiliation:

1. Centre for Infectious Disease Genomics and One Health, Faculty of Health Sciences, Simon Fraser University, Burnaby, British Columbia, Canada

2. Theiagen Genomics, LLC, Highlands Ranch, Colorado, USA

3. Department of Community Health & Epidemiology, Faculty of Medicine, Dalhousie University, Halifax, Nova Scotia, Canada, and Faculty of Computer Science, Dalhousie University, Halifax, Nova Scotia, Canada

4. Department of Microbiology & Immunology, Western University, London, Ontario, Canada

5. The Roslin Institute, University of Edinburgh, Edinburgh, UK

6. National Center for Emerging and Zoonotic Infectious Diseases, Centers for Disease Control and Prevention, Georgia, USA

7. National Microbiology Laboratory, Public health Agency of Canada, Winnipeg, MB, Canada

8. MRC Centre for Global Infectious Disease Analysis, School of Public Health, Imperial College London, London, UK

9. Department of Infection Biology, London School of Hygiene and Tropical Medicine, London, UK

10. National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA

11. European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, UK

12. Center for Food Safety, University of Georgia, Georgia, USA

13. Wyoming Public Health Laboratory, Wyoming, USA

14. McMaster University, Hamilton, Ontario, Canada

15. Chan Zuckerberg Biohub, San Francisco, CA, USA

16. Department of Biology, College of Science, United Arab Emirates University- AL Ain, Abu Dhabi, UAE

17. Michael G. DeGroote Institute for Infectious Disease Research, McMaster University, Hamilton, Ontario, Canada

18. Center for Food Safety and Applied Nutrition, U.S. Food and Drug Administration, College Park, Maryland, USA

Abstract

As public health laboratories expand their genomic sequencing and bioinformatics capacity for the surveillance of different pathogens, labs must carry out robust validation, training, and optimization of wet- and dry-lab procedures. Achieving these goals for algorithms, pipelines and instruments often requires that lower quality datasets be made available for analysis and comparison alongside those of higher quality. This range of data quality in reference sets can complicate the sharing of sub-optimal datasets that are vital for the community and for the reproducibility of assays. Sharing of useful, but sub-optimal datasets requires careful annotation and documentation of known issues to enable appropriate interpretation, avoid being mistaken for better quality information, and for these data (and their derivatives) to be easily identifiable in repositories. Unfortunately, there are currently no standardized attributes or mechanisms for tagging poor-quality datasets, or datasets generated for a specific purpose, to maximize their utility, searchability, accessibility and reuse. The Public Health Alliance for Genomic Epidemiology (PHA4GE) is an international community of scientists from public health, industry and academia focused on improving the reproducibility, interoperability, portability, and openness of public health bioinformatic software, skills, tools and data. To address the challenges of sharing lower quality datasets, PHA4GE has developed a set of standardized contextual data tags, namely fields and terms, that can be included in public repository submissions as a means of flagging pathogen sequence data with known quality issues, increasing their discoverability. The contextual data tags were developed through consultations with the community including input from the International Nucleotide Sequence Data Collaboration (INSDC), and have been standardized using ontologies - community-based resources for defining the tag properties and the relationships between them. The standardized tags are agnostic to the organism and the sequencing technique used and thus can be applied to data generated from any pathogen using an array of sequencing techniques. The tags can also be applied to synthetic (lab created) data. The list of standardized tags is maintained by PHA4GE and can be found at https://github.com/pha4ge/contextual_data_QC_tags. Definitions, ontology IDs, examples of use, as well as a JSON representation, are provided. The PHA4GE QC tags were tested, and are now implemented, by the FDA’s GenomeTrakr laboratory network as part of its routine submission process for SARS-CoV-2 wastewater surveillance. We hope that these simple, standardized tags will help improve communication regarding quality control in public repositories, in addition to making datasets of variable quality more easily identifiable. Suggestions for additional tags can be submitted to PHA4GE via the New Term Request Form in the GitHub repository. By providing a mechanism for feedback and suggestions, we also expect that the tags will evolve with the needs of the community.

Funder

Public Health Agency of Canada

MRC Centre for Global Infectious Disease Analysis

National Center for Biotechnology Information of the National Library of Medicine (NLM), National Institutes of Health

Biotechnology and Biological Sciences Research Council

Publisher

Microbiology Society

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3