Abstract
AbstractDeep learning techniques can be effective in helping doctors diagnose gastrointestinal polyps. Currently, processing video frame sequences containing a large amount of spurious noise in polyp detection suffers from elevated recall and mean average precision. Moreover, the mean average precision is also low when the polyp target in the video frame has large-scale variability. Therefore, we propose a tiny polyp detection from endoscopic video frames using Vision Transformers, named TPolyp. The proposed method uses a cross-stage Swin Transformer as a multi-scale feature extractor to extract deep feature representations of data samples, improves the bidirectional sampling feature pyramid, and integrates the prediction heads of multiple channel self-attention mechanisms. This approach focuses more on the feature information of the tiny object detection task than convolutional neural networks and retains relatively deeper semantic information. It additionally improves feature expression and discriminability without increasing the computational complexity. Experimental results show that TPolyp improves detection accuracy by 7%, recall by 7.3%, and average accuracy by 7.5% compared to the YOLOv5 model, and has better tiny object detection in scenarios with blurry artifacts.
Funder
Hangzhou Science and Technology Bureau
Publisher
Springer Science and Business Media LLC
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献