Pedestrian motion recognition is one of the important components of an intelligent transportation system. Since commonly used spatial-temporal features are still not sufficient for mining deep information in frames, this study proposes a three-stream neural network called a spatial-temporal-relational network (STRN), where the static spatial information, dynamic motion and differences between adjunct keyframes are comprehensively considered as features of the video records. In addition, an optimised pooling layer called convolutional vector of locally aggregated descriptors layer (Conv-VLAD) is employed before the final classification step in each stream to better aggregate the extracted features and reduce the inter-class differences. To accomplish this, the original video records are required to be processed into RGB images, optical flow images and RGB difference images to deliver the respective information for each stream. After the classification result is obtained from each stream, a decision-level fusion mechanism is introduced to improve the network's overall accuracy via combining the partial understandings together. Experimental results on two public data sets UCF101 (94.7%) and HMDB51 (69.0%), show that the proposed method achieves significantly improved performance. The results of STRN have far-reaching significance for the application of deep learning in intelligent transportation systems to ensure pedestrian safety.
CITATION STYLE
Peng, S., Su, T., Jin, X., Kong, J., & Bai, Y. (2020). Pedestrian motion recognition via Conv-VLAD integrated spatial-temporal-relational network. IET Intelligent Transport Systems, 14(5), 392–400. https://doi.org/10.1049/iet-its.2019.0471
Mendeley helps you to discover research relevant for your work.