Quality Matters: Biocuration Experts on the Impact of Duplication and Other Data Quality Issues in Biological Databases-Reference-Cited by-同舟云学术

Quality Matters: Biocuration Experts on the Impact of Duplication and Other Data Quality Issues in Biological Databases

Published:2019-10-05 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Chen Qingyu^ORCID,Britto Ramona^ORCID,Erill Ivan^ORCID,J. Jeffery Constance^ORCID,Liberzon Arthur,Magrane Michele^ORCID,Onami Jun-ichi^ORCID,Robinson-Rechavi Marc^ORCID,Sponarova Jana^ORCID,Zobel Justin^ORCID,Verspoor Karin^ORCID

Abstract

AbstractThe volume of biological database records is growing rapidly, populated by complex records drawn from heterogeneous sources. A specific challenge is duplication, that is, the presence of redundancy (records with high similarity) or inconsistency (dissimilar records that correspond to the same entity). The characteristics (which records are duplicates), impact (why duplicates are significant), and solutions (how to address duplication), are not well understood. Studies on the topic are neither recent nor comprehensive. In addition, other data quality issues, such as inconsistencies and inaccuracies, are also of concern in the context of biological databases. A primary focus of this paper is to present and consolidate the opinions of over 20 experts and practitioners on the topic of duplication in biological sequence databases. The results reveal that survey participants believe that duplicate records are diverse; that the negative impacts of duplicates are severe, while positive impacts depend on correct identification of duplicates; and that duplicate detection methods need to be more precise, scalable, and robust. A secondary focus is to consider other quality issues. We observe that biocuration is the key mechanism used to ensure the quality of this data, and explore the issues through a case study of curation in UniProtKB/Swiss-Prot as well as an interview with an experienced biocurator. While biocuration is a vital solution for handling of data quality issues, a broader community effort is needed to provide adequate support for thorough biocuration in the face of widespread quality concerns.

Publisher

Cold Spring Harbor Laboratory

Reference109 articles.

1. Searching and Navigating UniProt Databases

2. GenBank

3. European Nucleotide Archive in 2016

4. The international nucleotide sequence database collaboration;Nucleic Acids Res,2017

5. UniProt: the universal protein knowledgebase

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Data quality-aware genomic data integration;Computer Methods and Programs in Biomedicine Update;2021

2. Openness and trust in data-intensive science: the case of biocuration;Medicine, Health Care and Philosophy;2020-06-10