Daily scenes are complex in the real world due to occlusion, undesired lighting conditions, etc. Although humans handle those complicated environments well, they evoke challenges for machine learning systems to identify and describe the target without ambiguity. Most previous research focuses on mining discriminating features within the same category for the target object. One the other hand, as the scene becomes more complicated, human frequently uses the neighbor objects as complementary information to describe the target one. Motivated by that, we propose a novel Complementary Neighboring-based Attention Network (CoNAN) that explicitly utilizes the visual differences between the target object and its highly-related neighbors. These highly-related neighbors are determined by an attentional ranking module, as complementary features, highlighting the discriminating aspects for the target object. The speaker module then takes the visual difference features as an additional input to generate the expression. Our qualitative and quantitative results on the dataset RefCOCO, RefCOCO+, and RefCOCOg demonstrate that our generated expressions outperform other state-of-the-art models by a clear margin.
CITATION STYLE
Kim, J., Ko, H., & Wu, J. (2020). CoNAN: A Complementary Neighboring-based Attention Network for Referring Expression Generation. In COLING 2020 - 28th International Conference on Computational Linguistics, Proceedings of the Conference (pp. 1952–1962). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2020.coling-main.177
Mendeley helps you to discover research relevant for your work.