Machine learning for identification of silylated derivatives from mass spectra-Reference-Cited by-同舟云学术

Machine learning for identification of silylated derivatives from mass spectra

Published:2022-09-15 Issue:1 Volume:14 Page:
ISSN:1758-2946
Container-title:Journal of Cheminformatics
language:en
Short-container-title:J Cheminform

Author:

Ljoncheva Milka,Stepišnik Tomaž,Kosjek Tina,Džeroski Sašo^ORCID

Abstract

Abstract Motivation Compound structure identification is using increasingly more sophisticated computational tools, among which machine learning tools are a recent addition that quickly gains in importance. These tools, of which the method titled Compound Structure Identification:Input Output Kernel Regression (CSI:IOKR) is an excellent example, have been used to elucidate compound structure from mass spectral (MS) data with significant accuracy, confidence and speed. They have, however, largely focused on data coming from liquid chromatography coupled to tandem mass spectrometry (LC–MS). Gas chromatography coupled to mass spectrometry (GC–MS) is an alternative which offers several advantages as compared to LC–MS, including higher data reproducibility. Of special importance is the substantial compound coverage offered by GC–MS, further expanded by derivatization procedures, such as silylation, which can improve the volatility, thermal stability and chromatographic peak shape of semi-volatile analytes. Despite these advantages and the increasing size of compound databases and MS libraries, GC–MS data have not yet been used by machine learning approaches to compound structure identification. Results This study presents a successful application of the CSI:IOKR machine learning method for the identification of environmental contaminants from GC–MS spectra. We use CSI:IOKR as an alternative to exhaustive search of MS libraries, independent of instrumental platform and data processing software. We use a comprehensive dataset of GC–MS spectra of trimethylsilyl derivatives and their molecular structures, derived from a large commercially available MS library, to train a model that maps between spectra and molecular structures. We test the learned model on a different dataset of GC–MS spectra of trimethylsilyl derivatives of environmental contaminants, generated in-house and made publicly available. The results show that 37% (resp. 50%) of the tested compounds are correctly ranked among the top 10 (resp. 20) candidate compounds suggested by the model. Even though spectral comparisons with reference standards or de novo structural elucidations are neccessary to validate the predictions, machine learning provides efficient candidate prioritization and reduction of the time spent for compound annotation.

Publisher

Springer Science and Business Media LLC

Subject

Library and Information Sciences,Computer Graphics and Computer-Aided Design,Physical and Theoretical Chemistry,Computer Science Applications

Link

https://link.springer.com/content/pdf/10.1186/s13321-022-00636-1.pdf

Reference76 articles.

1. Lippmann M (2013) Exposure science in the 21st century: a vision and a strategy. J Expo Sci Environ Epidemiol 23(1):1–1

2. Wild CP (2005) Complementing the genome with an “exposome”: the outstanding challenge of environmental exposure measurement in molecular epidemiology. Cancer Epidemiol Biomarkers Prev 14(8):1847–50. https://doi.org/10.1158/1055-9965.EPI-05-0456

3. Vermeulen R, Schymanski EL, Barabási AL, Miller GW (2020) The exposome and health: where chemistry meets biology. Science 367(6476):392–6. https://doi.org/10.1126/science.aay3164

4. Council NR (2012) Exposure science in the 21st century: a vision and a strategy. The National Academies Press, Washington

5. Schymanski EL, Kondić T, Neumann S, Thiessen PA, Zhang J, Bolton EE (2021) Empowering large chemical konledge bases for exposomics: PubChemLite meets MetFrag. J Cheminformatics. https://doi.org/10.1186/s13321-021-00489-0

Cited by 6 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Comprehensive steroid screening in bovine and porcine urine by GC-HRMS;Microchemical Journal;2024-09

2. Machine learning methods for compound annotation in non‐targeted mass spectrometry—A brief overview of fingerprinting, in silico fragmentation and de novo methods;Rapid Communications in Mass Spectrometry;2024-08-24

3. Beyond target chemicals: updating the NORMAN prioritisation scheme to support the EU chemicals strategy with semi-quantitative suspect/non-target screening data;Environmental Sciences Europe;2024-06-12

4. Evaluation of normalization strategies for GC-based metabolomics;Metabolomics;2024-02-12

5. Reactivity-based identification of oxygen containing functional groups of chemicals applied as potential classifier in non-target analysis;Scientific Reports;2023-12-20