SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization

36Citations
Citations of this article
71Readers
Mendeley users who have this article in their library.

Abstract

Data scarcity has been a long standing issue in the field of open-domain social dialogue. To quench this thirst, we present SODA: the first publicly available, million-scale high-quality social dialogue dataset. By contextualizing social commonsense knowledge from a knowledge graph, we are able to distill an exceptionally broad spectrum of social interactions from a large language model. Human evaluation shows that conversations in SODA are more consistent, specific, and (surprisingly) natural than those in prior human-authored datasets. Using SODA, we train COSMO: a generalizable conversation model that is significantly more natural and consistent on unseen datasets than best-performing conversation models (e.g., GODEL, BlenderBot-1, Koala, Vicuna). Experiments reveal COSMO is sometimes even preferred to the original human-written gold responses. Additionally, our results shed light on the distinction between knowledge-enriched conversations and natural social chitchats. We make our data, models, and code public.

Cite

CITATION STYLE

APA

Kim, H., Hessel, J., Jiang, L., West, P., Lu, X., Yu, Y., … Choi, Y. (2023). SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization. In EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings (pp. 12930–12949). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.emnlp-main.799

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free