Author:
Liu Hongqi,Shi Shaolong,Ma Tongle
Abstract
Abstract
When applying VIT to image classification, although good accuracy has been achieved, there is a limitation on the number of stacked layers. The similarity between different tokens increases as the model deepens, leading to low accuracy in shallow VIT models and wasting computational resources in deep VIT models. To address this issue, a multi-valued self-attention mechanism is proposed. By introducing an additional clue “V,” that does not vary arbitrarily, the dot product similarity observes each position in the token sequence more, reducing the depth of the attention mechanism network and avoiding the problem of similarity disappearing between deep tokens. Additionally, the loss function is improved by combining the Focal Loss function with Label Smoothing, enhancing the model’s handling of uncertain images while retaining the category focus of Focal Loss. Experiments are conducted on a rock lithology classification dataset, showing that compared to the VGG model and ResNet-18, the accuracy of the ResNet-50 model improves by 17.6%, 12%, and 4.9%, respectively. This paper also demonstrates the effectiveness of the multi-valued self-attention mechanism, which significantly reduces depth and parameter count while maintaining comparable accuracy. Finally, generalization experiments on the CIFAR-10 dataset further validate these conclusions.