CMCC: A Comprehensive and Large-Scale Human-Human Dataset for Dialogue Systems

Yi Huang; Xiaoting Wu; Si Chen; Wei Hu; Qing Zhu; Junlan Feng; Chao Deng; Zhijian Ou; Jiangjiang Zhao

Conference ProceedingsOPEN ACCESS

CMCC: A Comprehensive and Large-Scale Human-Human Dataset for Dialogue Systems

SereTOD 2022 - Towards Semi-Supervised and Reinforced Task-Oriented Dialog Systems, Proceedings of the Workshop (2022) 48-61

DOI: 10.18653/v1/2022.seretod-1.7

0Citations

14Readers

Abstract

Dialogue modeling problems severely limit the real-world deployment of neural conversational models and building a human-like dialogue agent is an extremely challenging task. Recently, data-driven models become more and more prevalent which need a huge amount of conversation data. In this paper, we release around 100,000 dialogue, which come from real-world dialogue transcripts between real users and customer-service staffs. We call this dataset as CMCC (China Mobile Customer Care) dataset, which differs from existing dialogue datasets in both size and nature significantly. The dataset reflects several characteristics of human-human conversations, e.g., task-driven, care-oriented, and long-term dependency among the context. It also covers various dialogue types including task-oriented, chitchat and conversational recommendation in real-world scenarios. To our knowledge, CMCC is the largest real human-human spoken dialogue dataset and has dozens of times the data scale of others, which shall significantly promote the training and evaluation of dialogue modeling methods. The results of extensive experiments indicate that CMCC is challenging and needs further effort. We hope that this resource will allow for more effective models across various dialogue sub-problems to be built in the future.

Cite

CITATION STYLE

APA

Huang, Y., Wu, X., Chen, S., Hu, W., Zhu, Q., Feng, J., … Zhao, J. (2022). CMCC: A Comprehensive and Large-Scale Human-Human Dataset for Dialogue Systems. In SereTOD 2022 - Towards Semi-Supervised and Reinforced Task-Oriented Dialog Systems, Proceedings of the Workshop (pp. 48–61). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.seretod-1.7

CMCC: A Comprehensive and Large-Scale Human-Human Dataset for Dialogue Systems

Abstract

Cite

Register to see more suggestions