Abstract
In this paper, we investigate whether fusing depth information on top of normal RGB data for camera-based object detection can help to increase the performance of current state-of-the-art single-shot detection networks. Indeed, depth sensing is easily acquired using depth cameras such as a Kinect or stereo setups. We investigate the optimal manner to perform this sensor fusion with a special focus on lightweight single-pass convolutional neural network (CNN) architectures, enabling real-time processing on limited hardware. For this, we implement a network architecture allowing us to parameterize at which network layer both information sources are fused together. We performed exhaustive experiments to determine the optimal fusion point in the network, from which we can conclude that fusing towards the mid to late layers provides the best results. Our best fusion models significantly outperform the baseline RGB network in both accuracy and localization of the detections.
Subject
Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry
Reference35 articles.
1. Integral channel features;Dollár,2009
2. Fast Feature Pyramids for Object Detection
3. Multispectral pedestrian detection: Benchmark dataset and baseline;Hwang;Integr. Comput. Aided Eng.,2013
4. How to reach top accuracy for a visual pedestrian warning system from a car?
Cited by
55 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献