Abstract
Real-world scenes typically have complex structure, and utterances about them consequently do as well. We devise and evaluate a model that processes descriptions of complex configurations of geometric shapes and can identify the described scenes among a set of candidates, including similar distractors. The model works with raw images of scenes, and by design can work word-by-word incrementally. Hence, it can be used in highly-responsive interactive and situated settings. Using a corpus of descriptions from game-play between human subjects (who found this to be a challenging task), we show that reconstruction of description structure in our system contributes to task success and supports the performance of the word-based model of grounded semantics that we use.
Cite
CITATION STYLE
Manuvinakurike, R., Kennington, C., DeVault, D., & Schlangen, D. (2016). Real-Time Understanding of Complex Discriminative Scene Descriptions. In SIGDIAL 2016 - 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Proceedings of the Conference (pp. 232–241). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w16-3630
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.