PubChem synonym filtering process using crowdsourcing

Author:

Kim Sunghwan,Yu Bo,Li Qingliang,Bolton Evan E.

Abstract

AbstractPubChem (https://pubchem.ncbi.nlm.nih.gov) is a public chemical information resource containing more than 100 million unique chemical structures. One of the most requested tasks in PubChem and other chemical databases is to search chemicals by name (also commonly called a “chemical synonym”). PubChem performs this task by looking up chemical synonym-structure associations provided by individual depositors to PubChem. In addition, these synonyms are used for many purposes, including creating links between chemicals and PubMed articles (using Medical Subject Headings (MeSH) terms). However, these depositor-provided name-structure associations are subject to substantial discrepancies within and between depositors, making it difficult to unambiguously map a chemical name to a specific chemical structure. The present paper describes PubChem’s crowdsourcing-based synonym filtering strategy, which resolves inter- and intra-depositor discrepancies in synonym-structure associations as well as in the chemical-MeSH associations. The PubChem synonym filtering process was developed based on the analysis of four crowd-voting strategies, which differ in the consistency threshold value employed (60% vs 70%) and how to resolve intra-depositor discrepancies (a single vote vs. multiple votes per depositor) prior to inter-depositor crowd-voting. The agreement of voting was determined at six levels of chemical equivalency, which considers varying isotopic composition, stereochemistry, and connectivity of chemical structures and their primary components. While all four strategies showed comparable results, Strategy I (one vote per depositor with a 60% consistency threshold) resulted in the most synonyms assigned to a single chemical structure as well as the most synonym-structure associations disambiguated at the six chemical equivalency contexts. Based on the results of this study, Strategy I was implemented in PubChem’s filtering process that cleans up synonym-structure associations as well as chemical-MeSH associations. This consistency-based filtering process is designed to look for a consensus in name-structure associations but cannot attest to their correctness. As a result, it can fail to recognize correct name-structure associations (or incorrect ones), for example, when a synonym is provided by only one depositor or when many contributors are incorrect. However, this filtering process is an important starting point for quality control in name-structure associations in large chemical databases like PubChem.

Funder

U.S. National Library of Medicine

National Library of Medicine

Publisher

Springer Science and Business Media LLC

Reference52 articles.

1. Kim S, Chen J, Cheng T, Gindulyte A, He J, He S, Li Q, Shoemaker BA, Thiessen PA, Yu B, Zaslavsky L, Zhang J, Bolton EE (2023) PubChem 2023 update. Nucleic Acids Res 51(D1):D1373–D1380. https://doi.org/10.1093/nar/gkac956

2. Kim S (2016) Getting the most out of PubChem for virtual screening. Expert Opin Drug Discov 11(9):843–855. https://doi.org/10.1080/17460441.2016.1216967

3. Sayers Eric W, Beck J, Bolton Evan E, Brister JR, Chan J, Comeau Donald C, Connor R, DiCuccio M, Farrell Catherine M, Feldgarden M, Fine Anna M, Funk K, Hatcher E, Hoeppner M, Kane M, Kannan S, Katz Kenneth S, Kelly C, Klimke W, Kim S, Kimchi A, Landrum M, Lathrop S, Lu Z, Malheiro A, Marchler-Bauer A, Murphy Terence D, Phan L, Prasad Arjun B, Pujar S, Sawyer A, Schmieder E, Schneider Valerie A, Schoch Conrad L, Sharma S, Thibaud-Nissen F, Trawick Barton W, Venkatapathi T, Wang J, Pruitt Kim D, Sherry Stephen T (2024) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 52(D1):D33–D43. https://doi.org/10.1093/nar/gkad1044

4. Kim S, Thiessen PA, Bolton EE, Chen J, Fu G, Gindulyte A, Han L, He J, He S, Shoemaker BA, Wang J, Yu B, Zhang J, Bryant SH (2016) PubChem substance and compound databases. Nucleic Acids Res 44(D1):D1202–D1213. https://doi.org/10.1093/nar/gkv951

5. Wang YL, Bryant SH, Cheng TJ, Wang JY, Gindulyte A, Shoemaker BA, Thiessen PA, He SQ, Zhang J (2017) PubChem BioAssay: 2017 update. Nucleic Acids Res 45(D1):D955–D963. https://doi.org/10.1093/nar/gkw1118

Cited by 1 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Glycoscience data content in the NCBI Glycans and PubChem;Analytical and Bioanalytical Chemistry;2024-08-12

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3