We present a neural network based system capable of learning a multimodal representation of images and words. This representation allows for bidirectional grounding of the meaning of words and the visual attributes that they represent, such as colour, size and object name. We also present a new dataset captured specifically for this task.
CITATION STYLE
Sheppard, E., & Lohan, K. S. (2020). Multimodal representation learning for human robot interaction. In ACM/IEEE International Conference on Human-Robot Interaction (pp. 445–446). IEEE Computer Society. https://doi.org/10.1145/3371382.3378265
Mendeley helps you to discover research relevant for your work.