Affiliation:
1. Shanghai University of Engineering Science
2. Sul Ross State University
Abstract
3D perception of depth and ego-motion is of vital importance in intelligent agent and
Human Computer Interaction (HCI)
tasks, such as robotics and autonomous driving. There are different kinds of sensors that can directly obtain 3D depth information. However, the commonly used Lidar sensor is expensive, and the effective range of RGB-D cameras is limited. In the field of computer vision, researchers have done a lot of work on 3D perception. While traditional geometric algorithms require a lot of manual features for depth estimation, Deep Learning methods have achieved great success in this field. In this work, we proposed a novel self-supervised method based on
Vision Transformer (ViT)
with
Convolutional Neural Network (CNN)
architecture, which is referred to as
ViT-Depth
. The image reconstruction losses computed by the estimated depth and motion between adjacent frames are treated as supervision signal to establish a self-supervised learning pipeline. This is an effective solution for tasks that need accurate and low-cost 3D perception, such as autonomous driving, robotic navigation, 3D reconstruction, and so on. Our method could leverage both the ability of CNN and Transformer to extract deep features and capture global contextual information. In addition, we propose a cross-frame loss that could constrain photometric error and scale consistency among multi-frames, which lead the training process to be more stable and improve the performance. Extensive experimental results on autonomous driving dataset demonstrate the proposed approach is competitive with the state-of-the-art depth and motion estimation methods.
Funder
National Natural Science Foundation of China
Shanghai Local Capacity Enhancement
Science and Technology Innovation Action Plan
Shanghai Science and Technology Commission
Chenguang talented program of Shanghai
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Networks and Communications,Hardware and Architecture
Reference49 articles.
1. Energy-quality scalable monocular depth estimation on low-power CPUs;Cipolletta Antonio;IEEE Internet of Things Journal,2021
2. Joint 3-D Shape Estimation and Landmark Localization From Monocular Cameras of Intelligent Vehicles
3. Attention is all you need;Vaswani Ashish;NIPS,2017
4. Squeeze-and-Excitation Networks
5. Vision meets robotics: The KITTI dataset
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献