Surrounding-aware representation prediction in Birds-Eye-View using transformers-Reference-Cited by-同舟云学术

Surrounding-aware representation prediction in Birds-Eye-View using transformers

Published:2023-07-04 Issue: Volume:17 Page:
ISSN:1662-453X
Container-title:Frontiers in Neuroscience
language:
Short-container-title:Front. Neurosci.

Author:

Yu Jiahui,Zheng Wenli,Chen Yongquan,Zhang Yutong,Huang Rui

Abstract

Birds-Eye-View (BEV) maps provide an accurate representation of sensory cues present in the surroundings, including dynamic and static elements. Generating a semantic representation of BEV maps can be a challenging task since it relies on object detection and image segmentation. Recent studies have developed Convolutional Neural networks (CNNs) to tackle the underlying challenge. However, current CNN-based models encounter a bottleneck in perceiving subtle nuances of information due to their limited capacity, which constrains the efficiency and accuracy of representation prediction, especially for multi-scale and multi-class elements. To address this issue, we propose novel neural networks for BEV semantic representation prediction that are built upon Transformers without convolution layers in a significantly different way from existing pure CNNs and hybrid architectures that merge CNNs and Transformers. Given a sequence of image frames as input, the proposed neural networks can directly output the BEV maps with per-class probabilities in end-to-end forecasting. The core innovations of the current study contain (1) a new pixel generation method powered by Transformers, (2) a novel algorithm for image-to-BEV transformation, and (3) a novel network for image feature extraction using attention mechanisms. We evaluate the proposed Models performance on two challenging benchmarks, the NuScenes dataset and the Argoverse 3D dataset, and compare it with state-of-the-art methods. Results show that the proposed model outperforms CNNs, achieving a relative improvement of 2.4 and 5.2% on the NuScenes and Argoverse 3D datasets, respectively.

Funder

Shenzhen Science and Technology Innovation Program

Publisher

Frontiers Media SA

Subject

General Neuroscience

Reference41 articles.

1. Layer normalization;Ba;arXiv,2016

2. “NuScenes: a multimodal dataset for autonomous driving,”;Caesar,2020

3. “Efficient grasp detection network with gaussian-based grasp representation for robotic manipulation,”;Cao

4. Neurograsp: multimodal neural network with euler region regression for neuromorphic vision-based grasp pose estimation;Cao;IEEE Trans. Instrum. Meas

5. “End-to-end object detection with transformers,”;Carion,2020

Cited by 6 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Efficient multi-level cross-modal fusion and detection network for infrared and visible image;Alexandria Engineering Journal;2024-12

2. Image Signal Communication and Sensing for Traffic Key Representation Prediction;Sensors and Materials;2024-08-08

3. Multiscale Object Detection Using Adaptive Context Redetecting in Remote Sensing Systems;Sensors and Materials;2024-08-08

4. Predicting Bird's-Eye-View Semantic Representations Using Correlated Context Learning;IEEE Robotics and Automation Letters;2024-05

5. Lightweight UAV Object-Detection Method Based on Efficient Multidimensional Global Feature Adaptive Fusion and Knowledge Distillation;Electronics;2024-04-19