Natural language descriptions for human activities in video streams

2Citations
Citations of this article
67Readers
Mendeley users who have this article in their library.
Get full text

Abstract

There has been continuous growth in the volume and ubiquity of video material. It has become essential to define video semantics in order to aid the searchability and retrieval of this data. We present a framework that produces textual descriptions of video, based on the visual semantic content. Detected action classes rendered as verbs, participant objects converted to noun phrases, visual properties of detected objects rendered as adjectives and spatial relations between objects rendered as prepositions. Further, in cases of zero-shot action recognition, a language model is used to infer a missing verb, aided by the detection of objects and scene settings. These extracted features are converted into textual descriptions using a template-based approach. The proposed video descriptions framework evaluated on the NLDHA dataset using ROUGE scores and human judgment evaluation.

Cite

CITATION STYLE

APA

Al Harbi, N., & Gotoh, Y. (2017). Natural language descriptions for human activities in video streams. In INLG 2017 - 10th International Natural Language Generation Conference, Proceedings of the Conference (pp. 85–94). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w17-3512

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free