MixImages: An Urban Perception AI Method Based on Polarization Multimodalities
Author:
Mo Yan12ORCID, Zhou Wanting1, Chen Wei3ORCID
Affiliation:
1. School of Information Engineering, Nanchang Hangkong University, Nanchang 330063, China 2. College of Aeronautics Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China 3. College of Geoscience and Surveying Engineering, China University of Mining & Technology, Beijing 100083, China
Abstract
Intelligent urban perception is one of the hot topics. Most previous urban perception models based on semantic segmentation mainly used RGB images as unimodal inputs. However, in natural urban scenes, the interplay of light and shadow often leads to confused RGB features, which diminish the model’s perception ability. Multimodal polarization data encompass information dimensions beyond RGB, which can enhance the representation of shadow regions, serving as additional data for assistance. Additionally, in recent years, transformers have achieved outstanding performance in visual tasks, and their large, effective receptive field can provide more discriminative cues for shadow regions. For these reasons, this study proposes a novel semantic segmentation model called MixImages, which can combine polarization data for pixel-level perception. We conducted comprehensive experiments on a polarization dataset of urban scenes. The results showed that the proposed MixImages can achieve an accuracy advantage of 3.43% over the control group model using only RGB images in the unimodal benchmark while gaining a performance improvement of 4.29% in the multimodal benchmark. Additionally, to provide a reference for specific downstream tasks, we also tested the impact of different combinations of polarization types on the overall segmentation accuracy. The proposed MixImages can be a new option for conducting urban scene perception tasks.
Funder
National Natural Science Foundation of China
Reference68 articles.
1. Chen, X., Li, S., Lim, S.-N., Torralba, A., and Zhao, H. (2023, January 1–6). Open-vocabulary panoptic segmentation with embedding modulation. Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France. 2. Kong, L., Liu, Y., Chen, R., Ma, Y., Zhu, X., Li, Y., Hou, Y., Qiao, Y., and Liu, Z. (2023, January 1–6). Rethinking range view representation for lidar segmentation. Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France. 3. Chen, K., Zou, Z., and Shi, Z. (2021). Building Extraction from Remote Sensing Images with Sparse Token Transformers. Remote Sens., 13. 4. Building Extraction With Vision Transformer;Wang;IEEE Trans. Geosci. Remote Sens.,2022 5. Chu, X., Zheng, A., Zhang, X., and Sun, J. (2020, January 13–19). Detection in crowded scenes: One proposal, multiple predictions. Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.
|
|