How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

  • Liu C
  • Lowe R
  • Serban I
 et al. 
  • 287


    Mendeley users who have this article in their library.
  • N/A


    Citations of this article.


We investigate evaluation metrics for dialogue response generation systems where supervised labels, such as task completion, are not available. Recent works in response generation have adopted metrics from machine translation to compare a model's generated response to a single target response. We show that these metrics correlate very weakly with human judgements in the non-technical Twitter domain, and not at all in the technical Ubuntu domain. We provide quantitative and qualitative results highlighting specific weaknesses in existing metrics, and provide recommendations for future development of better automatic evaluation metrics for dialogue systems.

Get free article suggestions today

Mendeley saves you time finding and organizing research

Sign up here
Already have an account ?Sign in

Find this document


  • Chia-Wei Liu

  • Ryan Lowe

  • Iulian V. Serban

  • Michael Noseworthy

  • Laurent Charlin

  • Joelle Pineau

Cite this document

Choose a citation style from the tabs below

Save time finding and organizing research with Mendeley

Sign up for free