1. Negar Arabzadeh, Amin Bigdeli, and Charles LA Clarke. 2024. Adapting Standard Retrieval Benchmarks to Evaluate Generated Answers. arXiv preprint arXiv:2401.04842 (2024).
2. Tom B Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020).
3. Yupeng Chang Xu Wang Jindong Wang Yuan Wu Kaijie Zhu Hao Chen Linyi Yang Xiaoyuan Yi Cunxiang Wang Yidong Wang et al. 2023. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109 (2023).
4. 2024. Scaling instruction-finetuned language models;Chung Hyung Won;Journal of Machine Learning Research,2024
5. James Clarke and Mirella Lapata. 2010. Discourse Constraints for Document Compression. Computational Linguistics, Vol. 36, 3 (2010).