Object detection in still images has been extensively investigated recent years. But object detection in videos is still a challenging research topic. Directly using those methods for still images in videos would suffer from blurring and low resolution in video images. Some methods utilized the temporal information to boost the detection accuracy, but they are usually expensive in time in estimating optical flow. In this paper, we propose Recurrent RetinaNet, a flexible end-to-end approach for object detection in videos. In this work, a backbone network is leveraged to generate several feature maps, then a feature pyramid network extracts pyramid features from the feature maps. Detection boxes are generated according to the shapes of pyramid features. Two subnets with convolutional layers and Convolutional LSTM layers, are added on the top for box regression and classification. Note that the boxes are generated regardless of the content of images, there may be an extreme foreground-background imbalance. Thus, focal loss, which has been shown effective in object detection on images, is employed as the loss function for the classification subnet. Experiments show that the approach improves the detection accuracy and avoids the detection loss of some objects in some cases compared to RetinaNet and the model complexity is still good enough for real-time applications.
CITATION STYLE
Li, X., Zhao, H., & Zhang, L. (2018). Recurrent retinaNet: A video object detection model based on focal loss. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11304 LNCS, pp. 499–508). Springer Verlag. https://doi.org/10.1007/978-3-030-04212-7_44
Mendeley helps you to discover research relevant for your work.