Multi-Modal Dialog State Tracking for Interactive Fashion Recommendation

Yaxiong Wu; Craig MacDonald; Iadh Ounis

Conference ProceedingsOPEN ACCESS

Multi-Modal Dialog State Tracking for Interactive Fashion Recommendation

RecSys 2022 - Proceedings of the 16th ACM Conference on Recommender Systems (2022) 124-133

DOI: 10.1145/3523227.3546774

9Citations

13Readers

Get full text

Abstract

Multi-modal interactive recommendation is a type of task that allows users to receive visual recommendations and express natural-language feedback about the recommended items across multiple iterations of interactions. However, such multi-modal dialog sequences (i.e. turns consisting of the system's visual recommendations and the user's natural-language feedback) make it challenging to correctly incorporate the users' preferences across multiple turns. Indeed, the existing formulations of interactive recommender systems suffer from their inability to capture the multi-modal sequential dependencies of textual feedback and visual recommendations because of their use of recurrent neural network-based (i.e., RNN-based) or transformer-based models. To alleviate the multi-modal sequential dependency issue, we propose a novel multi-modal recurrent attention network (MMRAN) model to effectively incorporate the users' preferences over the long visual dialog sequences of the users' natural-language feedback and the system's visual recommendations. Specifically, we leverage a gated recurrent network (GRN) with a feedback gate to separately process the textual and visual representations of natural-language feedback and visual recommendations into hidden states (i.e. representations of the past interactions) for multi-modal sequence combination. In addition, we apply a multi-head attention network (MAN) to refine the hidden states generated by the GRN and to further enhance the model's ability in dynamic state tracking. Following previous work, we conduct extensive experiments on the Fashion IQ Dresses, Shirts, and Tops & Tees datasets to assess the effectiveness of our proposed model by using a vision-language transformer-based user simulator as a surrogate for real human users. Our results show that our proposed MMRAN model can significantly outperform several existing state-of-the-art baseline models.

Author supplied keywords

Cite

CITATION STYLE

APA

Wu, Y., MacDonald, C., & Ounis, I. (2022). Multi-Modal Dialog State Tracking for Interactive Fashion Recommendation. In RecSys 2022 - Proceedings of the 16th ACM Conference on Recommender Systems (pp. 124–133). Association for Computing Machinery, Inc. https://doi.org/10.1145/3523227.3546774

Multi-Modal Dialog State Tracking for Interactive Fashion Recommendation

Abstract

Author supplied keywords

Cite

Register to see more suggestions