Author:
Li Yu,Zhang Shenyu,Wu Rui,Huang Xiutian,Chen Yongrui,Xu Wenhao,Qi Guilin,Min Dehai
Publisher
Springer Nature Singapore
Reference18 articles.
1. Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL, pp. 65–72 (2005)
2. Callison-Burch, C.: Fast, cheap, and creative: evaluating translation quality using amazon’s mechanical turk. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, pp. 286–295 (2009)
3. Chan, C.M., et al: Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201 (2023)
4. Fu, J., et al: Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166 (2023)
5. Ghazarian, S., et al: Better automatic evaluation of open-domain dialogue systems with contextualized embeddings. arXiv preprint arXiv:1904.10635 (2019)