Single shot multi-box object detectors [13] have been recently shown to achieve state-of-the-art performance on object detection tasks. We extend the single shot detection (SSD) framework in [13] and propose a generic architecture using a deep convolution-deconvolution network. Our architecture does not rely on any pretrained network, and can be pretrained in an unsupervised manner for a given image dataset. Furthermore, we propose a novel approach to combine feature maps from both convolution and deconvolution layers to predict bounding boxes and labels with improved accuracy. Our framework, Conv-Deconv SSD (CDSSD), with its two key contributions – unsupervised pretraining and multi-layer confluence of convolution-deconvolution feature maps – results in state-of-the-art performance while utilizing significantly less number of bounding boxes and improved identification of small objects. On 300 × 300 image inputs, we achieve 80.7% mAP on VOC07 and 78.1% mAP on VOC07+12 (1.7% to 2.8% improvement over StairNet [21], DSSD [5], SSD [13]). CDSSD achieves 30.2% mAP on COCO performing at-par with R-FCN [3] and faster-R-FCN [18], while working on smaller size input images. Furthermore, CDSSD matches SSD performance while utilizing 82% of data, and reduces the prediction time per image by 10%.
CITATION STYLE
Gabale, V., & Sawant, U. (2018). CDSSD: Refreshing single shot object detection using a conv-deconv network. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 10939 LNAI, pp. 309–321). Springer Verlag. https://doi.org/10.1007/978-3-319-93040-4_25
Mendeley helps you to discover research relevant for your work.