Abstract
In recent years, there has been an exponential growth in sequencing projects due to accelerated technological advances, leading to a significant increase in the amount of data and resulting in new challenges for biological sequence analysis. Consequently, the use of techniques capable of analyzing large amounts of data has been explored, such as machine learning (ML) algorithms. ML algorithms are being used to analyze and classify biological sequences, despite the intrinsic difficulty in extracting and finding representative biological sequence methods suitable for them. Thereby, extracting numerical features to represent sequences makes it statistically feasible to use universal concepts from Information Theory, such as Tsallis and Shannon entropy. In this study, we propose a novel Tsallis entropy-based feature extractor to provide useful information to classify biological sequences. To assess its relevance, we prepared five case studies: (1) an analysis of the entropic index q; (2) performance testing of the best entropic indices on new datasets; (3) a comparison made with Shannon entropy and (4) generalized entropies; (5) an investigation of the Tsallis entropy in the context of dimensionality reduction. As a result, our proposal proved to be effective, being superior to Shannon entropy and robust in terms of generalization, and also potentially representative for collecting information in fewer dimensions compared with methods such as Singular Value Decomposition and Uniform Manifold Approximation and Projection.
Funder
Coordenação de Aperfeicoamento de Pessoal de Nível Superior
Universidade de São Paulo
São Paulo Research Foundation
Deutsche Forschungsgemeinschaft
Subject
General Physics and Astronomy
Reference61 articles.
1. Intelligent mining of large-scale bio-data: Bioinformatics applications;Hashemi;Biotechnol. Biotechnol. Equip.,2018
2. Machine learning approaches and their current application in plant molecular biology: A systematic review;Silva;Plant Sci.,2019
3. A guide to machine learning for biologists;Greener;Nat. Rev. Mol. Cell Biol.,2022
4. Lou, H., Schwartz, M., Bruck, J., and Farnoud, F. Evolution of k-mer frequencies and entropy in duplication and substitution mutation systems. IEEE Trans. Inform. Theor., 2019.
5. Feature extraction approaches for biological sequences: A comparative study of mathematical features;Bonidia;Brief. Bioinform.,2021
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. BioAutoML: Democratizing Machine Learning in Life Sciences;Anais Estendidos do XXIV Simpósio Brasileiro de Computação Aplicada à Saúde (SBCAS 2024);2024-06-25
2. Bioinformatics tools for the sequence complexity estimates;Biophysical Reviews;2023-09-15
3. Non-additive entropies and statistical mechanics at the edge of chaos: a bridge between natural and social sciences;Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences;2023-08-14