Affiliation:
1. School of Remote Sensing and Information Engineering Wuhan University Wuhan China
2. Hubei Luojia Laboratory Wuhan China
3. Department of Mechanical and Automation Engineering The Chinese University of Hong Kong Hong Kong SAR China
Abstract
AbstractThe precise geometric representation and ability to handle long‐tail targets have led to the increasing attention towards vision‐centric 3D occupancy prediction, which models the real world as a voxel‐wise model solely through visual inputs. Despite some notable achievements in this field, many prior or concurrent approaches simply adapt existing spatial cross‐attention (SCA) as their 2D–3D transformation module, which may lead to informative coupling or compromise the global receptive field along the height dimension. To overcome these limitations, we propose a hierarchical occupancy (HierOcc) network featuring our innovative height‐aware cross‐attention (HACA) and hierarchical self‐attention (HSA) as its core modules to achieve enhanced precision and completeness in 3D occupancy prediction. The former module enables 2D–3D transformation, while the latter promotes voxels’ intercommunication. The key insight behind both modules is our multi‐height attention mechanism which ensures each attention head corresponds explicitly to a specific height, thereby decoupling height information while maintaining global attention across the height dimension. Extensive experiments show that our method brings significant improvements compared to baseline and surpasses all concurrent methods, demonstrating its superiority.
Reference30 articles.
1. MonoScene: Monocular 3D Semantic Scene Completion
2. Learning point cloud context information based on 3D transformer for more accurate and efficient classification
3. Deformable Convolutional Networks
4. Dosovitskiy A. Beyer L. Kolesnikov A. Weissenborn D. Zhai X. Unterthiner T.et al. (2020)An image is worth 16×16 words: transformers for image recognition at scale.Arxiv[Preprint].https://doi.org/10.48550/arXiv.2010.11929