Abstract
Abstract
Vision Transformer (ViT) shows potential in bearing fault
diagnosis due to its multi-head self-attention mechanism and
parallel feature extraction network which are efficient to achieve
the robust complete feature representation of the fault. However,
its adaption to the noise interference relies on the sufficient huge
amount of training samples to prepare the local features of the
fault and may suffer performance degradation when only a limited
number of samples are available for the model training. To combat
this challenge, an improved ViT diagnosis model based on the local
feature expansion, i.e., LFE-ViT, is proposed. An auxiliary feature
extraction block is introduced using a local feature expansion
network and works as a parallel module with the ViT encoder. Through
the enlargement of the receptive field, the multi-scale local
features on a high dimensional space are available upon the limited
samples. Then, through a feature embedding channel, the extracted
local features are transmitted to the ViT encoder. Finally, by
virtue of the multi-head self-attention mechanism to capture the
time sequence global information, a fault diagnosis model comprising
comprehensively local and global feature information is
derived. Experimental validation on the bearing fault dataset from
Case Western Reserve University shows that LFE-ViT has provided a
rather satisfactory diagnosis performance under limited samples and
noise environment.