In recent years, such intelligent systems as dialogue systems have been applied to daily living. They will be able to function better for users by associating conversations with dialogue situations, including a dialogue's location and the relationship between participants. However, since previous studies generally assumed that systems work under limited and specific dialogue situations, research has neglected the recognition of everyday dialogue situations. We propose a dialogue situation recognition method using Gated Recurrent Units (GRU) and Bidirectional Encoder Representations from Transformers (BERT) that fuse multimodal features. The target dialogue situations contain dialogue styles, places, activities, and the relations between participants. In our experiments, we used the Corpus of Everyday Japanese Conversation (CEJC), which records natural everyday conversations in various situations. Our models with multi-task learning obtained an average F1-score of 0.541 with multimodal features. The improvement of the BERT-based approach is 2.3 percentage points more than the GRU-based method. We also analyzed the relationship between the dialogue situation recognition performance and the size of the dataset. To the best of our knowledge, ours is the first study that tackles the understanding of dialogue scenes using audio, visual, and linguistic information.
CITATION STYLE
Chiba, Y., & Higashinaka, R. (2023). Dialogue Situation Recognition in Everyday Conversation from Audio, Visual, and Linguistic Information. IEEE Access, 11, 70819–70832. https://doi.org/10.1109/ACCESS.2023.3293846
Mendeley helps you to discover research relevant for your work.