Dialogue modeling problems severely limit the real-world deployment of neural conversational models and building a human-like dialogue agent is an extremely challenging task. Recently, data-driven models become more and more prevalent which need a huge amount of conversation data. In this paper, we release around 100,000 dialogue, which come from real-world dialogue transcripts between real users and customer-service staffs. We call this dataset as CMCC (China Mobile Customer Care) dataset, which differs from existing dialogue datasets in both size and nature significantly. The dataset reflects several characteristics of human-human conversations, e.g., task-driven, care-oriented, and long-term dependency among the context. It also covers various dialogue types including task-oriented, chitchat and conversational recommendation in real-world scenarios. To our knowledge, CMCC is the largest real human-human spoken dialogue dataset and has dozens of times the data scale of others, which shall significantly promote the training and evaluation of dialogue modeling methods. The results of extensive experiments indicate that CMCC is challenging and needs further effort. We hope that this resource will allow for more effective models across various dialogue sub-problems to be built in the future.
CITATION STYLE
Huang, Y., Wu, X., Chen, S., Hu, W., Zhu, Q., Feng, J., … Zhao, J. (2022). CMCC: A Comprehensive and Large-Scale Human-Human Dataset for Dialogue Systems. In SereTOD 2022 - Towards Semi-Supervised and Reinforced Task-Oriented Dialog Systems, Proceedings of the Workshop (pp. 48–61). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.seretod-1.7
Mendeley helps you to discover research relevant for your work.