Affiliation:
1. School of Informatics, Xiamen University, Xiamen, Fujian, China
2. Department of Stomatology, Zhongshan Hospital (Xiamen), Fudan University, Xiamen, Fujian, China
3. Department of Stomatology, Zhongshan Hospital, Xiamen University, Xiamen, Fujian, China
Abstract
BACKGROUND: Oral cancer is a malignant tumor that usually occurs within the tissues of the mouth. This type of cancer mainly includes tumors in the lining of the mouth, tongue, lips, buccal mucosa and gums. Oral cancer is on the rise globally, especially in some specific risk groups. The early stage of oral cancer is usually asymptomatic, while the late stage may present with ulcers, lumps, bleeding, etc. OBJECTIVE: The objective of this paper is to propose an effective and accurate method for the identification and classification of oral cancer. METHODS: We applied two deep learning methods, CNN and Transformers. First, we propose a new CANet classification model for oral cancer, which uses attention mechanisms combined with neglected location information to explore the complex combination of attention mechanisms and deep networks, and fully tap the potential of attention mechanisms. Secondly, we design a classification model based on Swim transform. The image is segmented into a series of two-dimensional image blocks, which are then processed by multiple layers of conversion blocks. RESULTS: The proposed classification model was trained and predicted on Kaggle Oral Cancer Images Dataset, and satisfactory results were obtained. The average accuracy, sensitivity, specificity and F1-Socre of Swin transformer architecture are 94.95%, 95.37%, 95.52% and 94.66%, respectively. The average accuracy, sensitivity, specificity and F1-Score of CANet model were 97.00%, 97.82%, 97.82% and 96.61%, respectively. CONCLUSIONS: We studied different deep learning algorithms for oral cancer classification, including convolutional neural networks, converters, etc. Our Attention module in CANet leverages the benefits of channel attention to model the relationships between channels while encoding precise location information that captures the long-term dependencies of the network. The model achieves a high classification effect with an accuracy of 97.00%, which can be used in the automatic recognition and classification of oral cancer.