Categorical Variable Mapping Considerations in Classification Problems: Protein Application-Reference-Cited by-同舟云学术

Categorical Variable Mapping Considerations in Classification Problems: Protein Application

Published:2023-01-05 Issue:2 Volume:11 Page:279
ISSN:2227-7390
Container-title:Mathematics
language:en
Short-container-title:Mathematics

Author:

Alfonso Perez Gerardo¹^ORCID,Castillo Raquel¹

Affiliation:

1. Biocomp Group, Institute of Advanced Materials (INAM), Universitat Jaume I, 12071 Castello, Spain

Abstract

The mapping of categorical variables into numerical values is common in machine learning classification problems. This type of mapping is frequently performed in a relatively arbitrary manner. We present a series of four assumptions (tested numerically) regarding these mappings in the context of protein classification using amino acid information. This assumption involves the mapping of categorical variables into protein classification problems without the need to use approaches such as natural language process (NLP). The first three assumptions relate to equivalent mappings, and the fourth involves a comparable mapping using a proposed eigenvalue-based matrix representation of the amino acid chain. These assumptions were tested across a range of 23 different machine learning algorithms. It is shown that the numerical simulations are consistent with the presented assumptions, such as translation and permutations, and that the eigenvalue approach generates classifications that are statistically not different from the base case or that have higher mean values while at the same time providing some advantages such as having a fixed predetermined dimensions regardless of the size of the analyzed protein. This approach generated an accuracy of 83.25%. An optimization algorithm is also presented that selects an appropriate number of neurons in an artificial neural network applied to the above-mentioned protein classification problem, achieving an accuracy of 85.02%. The model includes a quadratic penalty function to decrease the chances of overfitting.

Funder

Spanish Ministerio de Ciencia, Innovación y Universidades

Universitat Jaume I

Publisher

MDPI AG

Subject

General Mathematics,Engineering (miscellaneous),Computer Science (miscellaneous)

Link

https://www.mdpi.com/2227-7390/11/2/279/pdf

Reference79 articles.

1. Machine learning and the physical sciences;Carleo;Rev. Mod. Phys.,2019

2. Machine learning at the energy and intensity frontiers of particle physics;Radovic;Nature,2018

3. Physics-informed machine learning;Karniadakis;Nat. Rev. Phys.,2021

4. Deepsite: Protein-binding site predictor using 3D-convolutional neural networks;Jimenez;Bioinformatics,2017

5. Protein model quality assessment using 3D oriented convolutional neural networks;Pages;Bioinformatics,2019