1. [1] Walker, M.A., Litman, D.J., Kamm, C.A. and Abella, A.: PARADISE: A Framework for Evaluating Spoken Dialogue Agents, Proc. ACL 1997, pp.271-280 (1997).
2. [2] Galley, M., Brockett, C., Sordoni, A., Ji, Y., Auli, M., Quirk, C., Mitchell, M., Gao, J. and Dolan, B.: ΔBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets, Proc. ACL 2015, pp.445-450 (2015).
3. [3] Papineni, K., Roukos, S., Ward, T. and Zhu, W.-J.: BLEU: A Method for Automatic Evaluation of Machine Translation, Proc. ACL 2002, pp.311-318 (2002).
4. [4] Higashinaka, R., Funakoshi, K., Inaba, M., Tsunomori, Y., Takahashi, T. and Kaji, N.: Overview of Dialogue Breakdown Detection Challenge 3, Proc. DSTC6 (2017).
5. [5] Shang, L., Sakai, T., Li, H., Higashinaka, R., Miyao, Y., Arase, Y. and Nomoto, M.: Overview of the NTCIR-13 Short Text Conversation Task, Proc. NTCIR-13 (2017).