Large-scale vision-language pre-training has exhibited strong performance in various visual and textual understanding tasks. Recently, the textual encoders of multi-modal pre-trained models have been shown to generate high-quality textual representations, which often outperform models that are purely text-based, such as BERT. In this study, our objective is to utilize both textual and visual encoders of multi-modal pre-trained models to enhance language understanding tasks. We achieve this by generating an image associated with a textual prompt, thus enriching the representation of a phrase for downstream tasks. Results from experiments conducted on four benchmark datasets demonstrate that our proposed method, which leverages visually-enhanced text representations, significantly improves performance in the entity clustering task.
CITATION STYLE
Hsu, T. Y., Li, C. A., Huang, C. W., & Chen, Y. N. (2023). Visually-Enhanced Phrase Understanding. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 5879–5888). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-acl.363
Mendeley helps you to discover research relevant for your work.