What the appearance channel from two-stream architectures for activity recognition is learning?

0Citations
Citations of this article
4Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

The automatic recognition of human activities from video data is being led by spatio-temporal Convolutional Neural Networks (3D CNNs), in particular two-stream architectures such as I3D that reports the best accuracy so far. Despite the high performance in accuracy of this kind of architectures, very little is known about what they are really learning from data, resulting therefore in a lack of robustness and explainability. In this work we select the appearance channel from the I3D architecture and create a set of experiments aimed at explaining what this model is learning. Throughout the proposed experiments we provide evidence that this particular model is learning the texture of the largest area (which can be the activity or the background, depending on the distance from the camera to the action performed). In addition, we state several considerations to take into account when selecting the training data to achieve a better generalization of the model for human activity recognition.

Cite

CITATION STYLE

APA

Oves García, R., & Sucar, L. E. (2020). What the appearance channel from two-stream architectures for activity recognition is learning? In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 12088 LNCS, pp. 251–260). Springer. https://doi.org/10.1007/978-3-030-49076-8_24

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free