PatCID: an open-access dataset of chemical structures in patent documents-Reference-Cited by-同舟云学术

PatCID: an open-access dataset of chemical structures in patent documents

Published:2024-08-02 Issue:1 Volume:15 Page:
ISSN:2041-1723
Container-title:Nature Communications
language:en
Short-container-title:Nat Commun

Author:

Morin Lucas^ORCID,Weber Valéry,Meijer Gerhard Ingmar,Yu Fisher,Staar Peter W. J.

Abstract

AbstractThe automatic analysis of patent publications has potential to accelerate research across various domains, including drug discovery and material science. Within patent documents, crucial information often resides in visual depictions of molecule structures. PatCID (Patent-extracted Chemical-structure Images database for Discovery) allows to access such information at scale. It enables users to search which molecules are displayed in which documents. PatCID contains 81M chemical-structure images and 14M unique chemical structures. Here, we compare PatCID with state-of-the-art chemical patent-databases. On a random set, PatCID retrieves 56.0% of molecules, which is higher than automatically-created databases, Google Patents (41.5%) and SureChEMBL (23.5%), as well as manually-created databases, Reaxys (53.5%) and SciFinder (49.5%). Leveraging state-of-the-art methods of document understanding, PatCID high-quality data outperforms currently available automatically-generated patent-databases. PatCID even competes with proprietary manually-created patent-databases. This enables promising applications for automatic literature review and learning-based molecular generation methods. The dataset is freely accessible for download.

Publisher

Springer Science and Business Media LLC

Link

https://www.nature.com/articles/s41467-024-50779-y.pdf

Reference64 articles.

1. Ohms, J. Current methodologies for chemical compound searching in patents: a case study. World Patent Inf. 66, 102055 (2021).

2. Bregonje, M. Patents: A unique source for scientific technical information in chemistry related industry? World Patent Inf. 27, 309–315 (2005).

3. Southan, C., Varkonyi, P., Boppana, K., Jagarlapudi, S. A. & Muresan, S. Tracking 20 years of compound-to-target output from literature and patents. PLoS ONE 8, 1–13 (2013).

4. Magariños, M. P. et al. Illuminating the druggable genome through patent bioactivity data. PeerJ 11, e15153 (2023).

5. Lawson, A. J., Swienty-Busch, J., Géoui, T. & Evans, D. The Making of Reaxys—Towards Unobstructed Access to Relevant Chemistry Information Ch. 8, 127–148 (American Chemical Society, 2014).

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Revealing Chemical Trends: Insights from Data-Driven Visualization and Patent Analysis in Exposomics Research;Environmental Science & Technology Letters;2024-08-30