Affiliation:
1. Shenzhen University
2. Shenzhen Technology University
Abstract
Abstract
Monocular depth estimation has a wide range of applications in the field of autostereoscopic displays, however, accuracy and robustness in complex scenes are still a challenge. In this paper, we propose a depth estimation network for autostereoscopic displays, which aims at improving the accuracy of monocular depth estimation by fusing Vision Transformer (ViT) and Convolutional Neural Network (CNN). Our approach feeds the input image as a sequence of visual features into the ViT module and utilizes its global perception capability to extract high-level semantic features of the image. The relationship between the losses is quantified by adding a weight correction module to improve model robustness. We conducted experimental evaluations on several public datasets, and the results show that AMENet achieves better accuracy and robustness than existing methods in different scenarios and complex conditions. In addition, we conduct a detailed experimental analysis to verify the effectiveness and stability of our method. In summary, AMENet is a promising depth estimation method that can provide higher robustness and accuracy for monocular depth estimation tasks.
Publisher
Research Square Platform LLC