Abstract
AbstractHalophilic proteins possess unique structural properties and exhibit high stability under extreme conditions. Such distinct characteristic makes them invaluable for applications in various aspects such as bioenergy, pharmaceuticals, environmental clean-up and energy production. Generally, halophilic proteins are discovered and characterized through labor-intensive and time-consuming wetlab experiments. Here, we introduced HPClas, a machine learning-based classifier developed using the catBoost ensemble learning technique to identify halophilic proteins. Extensivein silicocalculations were conducted on a large public data set of 24955 samples and an independent test set of 292 sample pairs, on which HPClas achieved an AUROC of 0.915 and 0.860, respectively. The source code and curated data set of HPClas are publicly available athttps://github.com/Showmake2/HPClas. In conclusion, HPClas can be explored as a promising tool to aid in the identification of halophilic proteins and accelerate their applications in different fields.Impact StatementIn this study, we used a method based on prediction of proteins secreted by extreme halophilic bacteria to successfully extract a large number of halophilic proteins. Using this data, we have trained an accurate halophilic protein classifier that could determine whether an input protein is halophilic with a high accuracy of 80%. This research could not only promote the exploration and mining of halophilic proteins in nature, but also provide guidance for the generation of mutant halophilic enzymes.
Publisher
Cold Spring Harbor Laboratory