Unsupervised alignment of natural language instructions with video segments

Iftekhar Naim; Young Chol Song; Qiguang Liu; Henry Kautz; Jiebo Luo; Daniel Gildea

Conference ProceedingsOPEN ACCESS

Unsupervised alignment of natural language instructions with video segments

Proceedings of the National Conference on Artificial Intelligence (2014) 2 1558-1564

DOI: 10.1609/aaai.v28i1.8939

28Citations

55Readers

Abstract

We propose an unsupervised learning algorithm for automatically inferring the mappings between English nouns and corresponding video objects. Given a sequence of natural language instructions and an unaligned video recording, we simultaneously align each instruction to its corresponding video segment, and also align nouns in each instruction to their corresponding objects in video. While existing grounded language acquisition algorithms rely on pre-aligned supervised data (each sentence paired with corresponding image frame or video segment), our algorithm aims to automatically infer the alignment from the temporal structure of the video and parallel text instructions. We propose two generative models that are closely related to the HMM and IBM 1 word alignment models used in statistical machine translation. We evaluate our algorithm on videos of biological experiments performed in wetlabs, and demonstrate its capability of aligning video seg-ments to text instructions and matching video objects to nouns in the absence of any direct supervision.

Cite

CITATION STYLE

APA

Naim, I., Song, Y. C., Liu, Q., Kautz, H., Luo, J., & Gildea, D. (2014). Unsupervised alignment of natural language instructions with video segments. In Proceedings of the National Conference on Artificial Intelligence (Vol. 2, pp. 1558–1564). AI Access Foundation. https://doi.org/10.1609/aaai.v28i1.8939

Unsupervised alignment of natural language instructions with video segments

Abstract

Cite

Register to see more suggestions