Accelerating Big Data Analysis through LASSO-Random Forest Algorithm in QSAR Studies-Reference-Cited by-同舟云学术

Accelerating Big Data Analysis through LASSO-Random Forest Algorithm in QSAR Studies

Published:2021-10-02 Issue:2 Volume:38 Page:469-475
ISSN:1367-4803
Container-title:Bioinformatics
language:en
Short-container-title:

Author:

Motamedi Fahimeh¹,Pérez-Sánchez Horacio²,Mehridehnavi Alireza¹,Fassihi Afshin³,Ghasemi Fahimeh¹⁴^ORCID

Affiliation:

1. Department of Bioinformatics and Systems Biology, School of Advanced Technologies in Medicine, Isfahan University of Medical Sciences, Isfahan 8174673461, Iran

2. Structural Bioinformatics and High Performance Computing Reseach Group (BIO-HPC), Computer Engineering Department, UCAM Universidad Católica de Murcia, Murcia E30107, Spain

3. Department of Medicinal Chemistry, School of Pharmacology and Pharmaceutical Sciences, Isfahan University of Medical Sciences, Isfahan 8174673461, Iran

4. Biosensor Research Centre, School of Advanced Technologies in Medicine, Isfahan University of Medical Sciences, Isfahan 8174673461, Iran

Abstract

Abstract Motivation The aim of quantitative structure–activity prediction (QSAR) studies is to identify novel drug-like molecules that can be suggested as lead compounds by means of two approaches, which are discussed in this article. First, to identify appropriate molecular descriptors by focusing on one feature-selection algorithms; and second to predict the biological activities of designed compounds. Recent studies have shown increased interest in the prediction of a huge number of molecules, known as Big Data, using deep learning models. However, despite all these efforts to solve critical challenges in QSAR models, such as over-fitting, massive processing procedures, is major shortcomings of deep learning models. Hence, finding the most effective molecular descriptors in the shortest possible time is an ongoing task. One of the successful methods to speed up the extraction of the best features from big datasets is the use of least absolute shrinkage and selection operator (LASSO). This algorithm is a regression model that selects a subset of molecular descriptors with the aim of enhancing prediction accuracy and interpretability because of removing inappropriate and irrelevant features. Results To implement and test our proposed model, a random forest was built to predict the molecular activities of Kaggle competition compounds. Finally, the prediction results and computation time of the suggested model were compared with the other well-known algorithms, i.e. Boruta-random forest, deep random forest and deep belief network model. The results revealed that improving output correlation through LASSO-random forest leads to appreciably reduced implementation time and model complexity, while maintaining accuracy of the predictions. Supplementary information Supplementary data are available at Bioinformatics online.

Funder

Isfahan University of Medical Sciences

Spanish Ministry of Economy and Competitiveness

Fundación Séneca del Centro de Coordinación de la Investigación de la Región de Murcia under Project

Publisher

Oxford University Press (OUP)

Subject

Computational Mathematics,Computational Theory and Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Statistics and Probability

Link

https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btab659/41149078/btab659.pdf

Reference32 articles.

1. High-dimensional QSAR prediction of anticancer potency of imidazo[4,5-b]pyridine derivatives using adjusted adaptive LASSO;Algamal;J. Chemom,2015

2. Streaming feature selection algorithms for big data: a survey;AlNuaimi;Appl. Comput. Inf.,2019

3. Protein kinase inhibitors’ classification using K-nearest neighbor algorithm;Arian;Comput. Biol. Chem,2020

4. QSAR modeling: where have you been? Where are you going to?;Cherkasov;J. Med. Chem,2014

Cited by 24 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Development and validation of AI-assisted transcriptomic signatures to personalize adjuvant chemotherapy in patients with pancreatic ductal adenocarcinoma;Annals of Oncology;2024-09

2. Construction of a risk prediction model for lung infection after chemotherapy in lung cancer patients based on the machine learning algorithm;Frontiers in Oncology;2024-08-09

3. Unraveling pathogenesis, biomarkers and potential therapeutic agents for endometriosis associated with disulfidptosis based on bioinformatics analysis, machine learning and experiment validation;Journal of Biological Engineering;2024-07-26

4. Identification of potential vascular endothelial growth factor receptor inhibitors via tree‐based learning modeling and molecular docking simulation;Journal of Chemometrics;2024-04

5. Explainable machine learning in outcome prediction of high-grade aneurysmal subarachnoid hemorrhage;Aging;2024-03-01