Abstract
The quality and traceability of microbial genomics data in public databases is deteriorating as they rapidly expand and struggle to cope with data curation challenges. While the availability of public genomic data has become essential for modern life sciences research, the curation of the data is a growing area of concern that has significant real-world impacts on public health epidemiology, drug discovery, and environmental biosurveillance research1–6. While public microbial genome databases such as NCBI’s RefSeq database leverage the scalability of crowd sourcing for growth, they do not require data provenance to the original biological source materials or accurate descriptions of how the data was produced7. Here, we describe the de novo assembly of 1,113 bacterial genome references produced from authenticated materials sourced from the American Type Culture Collection (ATCC), each with full data provenance. Over 98% of these ATCC Standard Reference Genomes (ASRGs) are superior to assemblies for comparable strains found in NCBI’s RefSeq database. Comparative genomics analysis revealed significant issues in RefSeq bacterial genome assemblies related to genome completeness, mutations, structural differences, metadata errors, and gaps in traceability to the original biological source materials. For example, nearly half of RefSeq assemblies lack details on sample source information, sequencing technology, or bioinformatics methods. We suggest there is an intrinsic connection between the quality of genomic metadata, the traceability of the data, and the methods used to produce them with the quality of the resulting genome assemblies themselves. Our results highlight common problems with “ reference genomes” and underscore the importance of data provenance for precision science and reproducibility. These gaps in metadata accuracy and data provenance represent an “ elephant in the room” for microbial genomics research, but addressing these issues would require raising the level of accountability for data depositors and our own expectations of data quality.
Publisher
Cold Spring Harbor Laboratory