Abstract
We present a data-driven framework for image caption generation which incorporates visual and textual features with varying degrees of spatial structure. We propose the task of domain-specific image captioning, where many relevant visual details cannot be captured by off-the-shelf general-domain entity detectors. We extract previously-written descriptions from a database and adapt them to new query images, using a joint visual and textual bag-of-words model to determine the correctness of individual words. We implement our model using a large, unlabeled dataset of women’s shoes images and natural language descriptions (Berg et al., 2010). Using both automatic and human evaluations, we show that our captioning method effectively deletes inaccurate words from extracted captions while maintaining a high level of detail in the generated output.
Cite
CITATION STYLE
Mason, R., & Charniak, E. (2014). Domain-specific image captioning. In CoNLL 2014 - 18th Conference on Computational Natural Language Learning, Proceedings (pp. 11–20). Association for Computational Linguistics (ACL). https://doi.org/10.3115/v1/w14-1602
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.