1. Jacob Austin Augustus Odena Maxwell Nye Maarten Bosma Henryk Michalewski David Dohan Ellen Jiang Carrie Cai Michael Terry and Quoc Le. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
2. Yanhong Bai Jiabao Zhao Jinxin Shi Tingjiang Wei Xingjiao Wu and Liang He. 2023. FairBench: A Four-Stage Automatic Framework for Detecting Stereotypes and Biases in Large Language Models. arXiv preprint arXiv:2308.10397.
3. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, and Greg Brockman. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
4. BOLD
5. Isabel O Gallegos, Ryan A Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K Ahmed. 2023. Bias and fairness in large language models: A survey. arXiv preprint arXiv:2309.00770.