Inductive Biases in Feature Reduction for QSAR: SHAP vs. Autoencoders
-
Published:2025-05-30
Issue:1
Volume:3
Page:40-49
-
ISSN:3025-8618
-
Container-title:Infolitika Journal of Data Science
-
language:
-
Short-container-title:Infolitika J. Data Sci.
Author:
Noviandy Teuku Rizky,Idroes Ghifari Maulana,Lala Andi,Helwani Zuchra,Idroes Rinaldi
Abstract
Machine learning models in drug discovery often depend on high-dimensional molecular descriptors, many of which may be redundant or irrelevant. Reducing these descriptors is essential for improving model performance, interpretability, and computational efficiency. This study compares two widely used reduction strategies: SHAP-based feature selection and autoencoder-based compression, within the context of Quantitative Structure-Activity Relationship (QSAR) classification. LightGBM is used as a consistent modeling framework to evaluate models trained on all descriptors, the top 50 and 100 SHAP-ranked descriptors, and a 64-dimensional autoencoder embedding. The results show that SHAP-based selection produces interpretable and stable models with minimal performance loss, particularly when using the top 100 descriptors. In contrast, the autoencoder achieves the highest test performance by capturing nonlinear patterns in a compact, low-dimensional representation, although this comes at the cost of interpretability and consistency across data splits. These findings reflect the differing inductive biases of each method. SHAP prioritizes sparsity and attribution, while autoencoders focus on reconstruction and continuity. The analysis emphasizes that descriptor reduction strategies are not interchangeable. SHAP-based selection is suitable for applications where interpretability and reliability are essential, such as in hypothesis-driven or regulatory settings. Autoencoders are more appropriate for performance-driven tasks, including virtual screening. The choice of reduction strategy should be guided not only by performance metrics but also by the specific modeling requirements and assumptions relevant to cheminformatics workflows.
Publisher
PT. Heca Sentra Analitika
Reference27 articles.
1. Gupta, R., Srivastava, D., Sahu, M., Tiwari, S., Ambasta, R. K., and Kumar, P. (2021). Artificial Intelligence to Deep Learning: Machine Intelligence Approach for Drug Discovery, Molecular Diversity, Vol. 25, No. 3, 1315–1360. doi:10.1007/s11030-021-10217-3. 2. Khan, S., Sarfraz, A., Prakash, O., and Khan, F. (2024). Machine Learning-Based QSAR Modeling, Molecular Docking, Dynamics Simulation Studies for Cytotoxicity Prediction in MDA-MB231 Triple-Negative Breast Cancer Cell Line, Journal of Molecular Structure, Vol. 1315, 138807. doi:10.1016/j.molstruc.2024.138807. 3. Noviandy, T. R., Maulana, A., Emran, T. B., Idroes, G. M., and Idroes, R. (2023). QSAR Classification of Beta-Secretase 1 Inhibitor Activity in Alzheimer’s Disease Using Ensemble Machine Learning Algorithms, Heca Journal of Applied Sciences, Vol. 1, No. 1, 1–7. doi:10.60084/hjas.v1i1.12. 4. Wigh, D. S., Goodman, J. M., and Lapkin, A. A. (2022). A Review of Molecular Representation in the Age of Machine Learning, WIREs Computational Molecular Science, Vol. 12, No. 5. doi:10.1002/wcms.1603. 5. Li, J., Luo, D., Wen, T., Liu, Q., and Mo, Z. (2021). Representative Feature Selection of Molecular Descriptors in QSAR Modeling, Journal of Molecular Structure, Vol. 1244, 131249. doi:10.1016/j.molstruc.2021.131249.
|
|