Connectionist temporal modeling of video and language: A joint model for translation and sign labeling

47Citations
Citations of this article
28Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Online sign interpretation suffers from challenges presented by hybrid semantics learning among sequential variations of visual representations, sign linguistics, and textual grammars. This paper proposes a Connectionist Temporal Modeling (CTM) network for sentence translation and sign labeling. To acquire short-term temporal correlations, a Temporal Convolution Pyramid (TCP) module is performed on 2D CNN features to realize (2D+1D)=pseudo 3D0 CNN features. CTM aligns the pseudo 3D0 with the original 3D CNN clip features and fuses them. Next, we implement a connectionist decoding scheme for long-term sequential learning. Here, we embed dynamic programming into the decoding scheme, which learns temporal mapping among features, sign labels, and the generated sentence directly. The solution using dynamic programming to sign labeling is considered as pseudo labels. Finally, we utilize the pseudo supervision cues in an end-to-end framework. A joint objective function is designed to measure feature correlation, entropy regularization on sign labeling, and probability maximization on sentence decoding. The experimental results using the RWTH-PHOENIX-Weather and USTC-CSL datasets demonstrate the effectiveness of the proposed approach.

Cite

CITATION STYLE

APA

Guo, D., Tang, S., & Wang, M. (2019). Connectionist temporal modeling of video and language: A joint model for translation and sign labeling. In IJCAI International Joint Conference on Artificial Intelligence (Vol. 2019-August, pp. 751–757). International Joint Conferences on Artificial Intelligence. https://doi.org/10.24963/ijcai.2019/106

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free