Author:
Tinh Nguyen Huy,Vinh Le Sy
Abstract
AbstractAmino acid substitution models play an important role in studying the evolutionary relationships among species from protein sequences. The amino acid substitution model consists of a large number of parameters; therefore, it is estimated from hundreds or thousands of alignments. Both general models and clade–specific models have been estimated and widely used in phylogenetic analyses. The maximum likelihood method is normally used to select the best fit model for a specific protein alignment under the study. A number of studies have discussed theoretical concerns as well as computational burden of the maximum likelihood methods in model selection. Recently, machine learning methods have been proposed for selecting nucleotide models. In this paper, we propose methods to create summary statistics from protein alignments to efficiently train a network of so-called ModelDetector based on the convolutional neural network ResNet-18 for detecting amino acid models. Experiments on simulation data showed that the accuracy of ModelDetector was comparable with that of the maximum likelihood method ModelFinder. The ModelDetector network was trained from 64,800 alignments on a computer with 8 cores (without GPU) in about 12 hours. It is orders of magnitudes faster than the maximum likelihood method in inferring amino acid substitution models and able to analyze genome alignments with million sites in minutes.
Publisher
Cold Spring Harbor Laboratory