Abstract
Real-time depth estimation is crucial in many vision-related tasks, including autonomous driving, 3D reconstruction, and SLAM applications. In recent years, many methods have been proposed to solve depth maps from images by utilizing different modality setups like monocular vision, binocular vision, or sensor fusion. However, for real-time deployment on edge devices, complex methods are not suitable due to latency constraints and the limited computation capacity. For edge implementation, the models should be simple, minimal in size, and also hardware-friendly. Considering these factors, we implemented MiDaSNet, which works on the simplest setup of monocular vision and utilizes hardware-friendly CNN-based architecture, for real-time depth estimation on the edge. Besides, since the model is trained on diverse datasets, it shows stable performance across different mediums. For edge implementation, we quantized the model weights down to an 8-bit fixed-point representation. Then, we deployed the quantized model on an inexpensive FPGA card, Kria KV260, utilizing predefined deep-learning processing units embedded in the programmable logic. The results show that our quantized model achieves 82.6% zero-shot accuracy on the NYUv2 dataset with 50.7 fps inference speed on the card.