The test results show that the fast Fourier process with multiple time superposition and a dimension length of 40 is most beneficial to the accuracy of the model. The loss curve value of the convolutional recurrent network model (CRN) is much lower than the other three models. The music tone recognition model learns better. The accuracy rate value and recall rate value of the CRN are the highest, and the accuracy rates of the four music tone indicators are 94.6%, 92.4%, 93.5%, 92.5%, and the recall rates were 93.2%, 94.9%, 95.2%, and 88.6% respectively; the improved algorithm was the most accurate in terms of F1 values and is suitable for use in vocal music teaching courses. The results show that the algorithm can be broadly performed in the zone of music tone recognition and has a certain contribution to the development of the field of music tone recognition.