Datasets for Large Language Models: A Comprehensive Survey

Author:

Liu Yang1,Cao Jiahuan1,Liu Chongyu1,Ding Kai2,Jin Lianwen1

Affiliation:

1. South China University of Technology

2. INTSIG Information Co., Ltd

Abstract

Abstract This paper embarks on an exploration into the Large Language Model (LLM) datasets, which play a crucial role in the remarkable advancements of LLMs. The datasets serve as the foundational infrastructure analogous to a root system that sustains and nurtures the development of LLMs. Consequently, examination of these datasets emerges as a critical topic in research. In order to address the current lack of a comprehensive overview and thorough analysis of LLM datasets, and to gain insights into their current status and future trends, this survey consolidates and categorizes the fundamental aspects of LLM datasets from five perspectives: (1) Pre-training Corpora; (2) Instruction Fine-tuning Datasets; (3) Preference Datasets; (4) Evaluation Datasets; (5) Traditional Natural Language Processing (NLP) Datasets. The survey sheds light on the prevailing challenges and points out potential avenues for future investigation. Additionally, a comprehensive review of the existing available dataset resources is also provided, including statistics from 444 datasets, covering 8 language categories and spanning 32 domains. Information from 20 dimensions is incorporated into the dataset statistics. The total data size surveyed surpasses 774.5 TB for pre-training corpora and 700M instances for other datasets. We aim to present the entire landscape of LLM text datasets, serving as a comprehensive reference for researchers in this field and contributing to future studies. Related resources are available at: \href{https://github.com/lmmlzn/Awesome-LLMs-Datasets}{https://github.com/lmmlzn/Awesome-LLMs-Datasets}.

Publisher

Research Square Platform LLC

Reference408 articles.

1. OpenAI. Introducing {ChatGPT}. https://openai.com/blog/chatgpt/. 2022

2. Aohan Zeng and Xiao Liu and Zhengxiao Du and Zihan Wang and Hanyu Lai and Ming Ding and Zhuoyi Yang and Yifan Xu and Wendi Zheng and Xiao Xia and Weng Lam Tam and Zixuan Ma and Yufei Xue and Jidong Zhai and Wenguang Chen and Zhiyuan Liu and Peng Zhang and Yuxiao Dong and Jie Tang (2023) {GLM}-130{B}: An Open Bilingual Pre-trained Model. 1--56, The Eleventh International Conference on Learning Representations

3. Du, Z and Qian, Y and Liu, X and Ding, M and Qiu, J and Yang, Z and others (2022) {GLM}: {G}eneral language model pretraining with autoregressive blank infilling. 320--335, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 1

4. Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth{\'e}e and Rozi{\`e}re, Baptiste and Goyal, Naman and Hambro, Eric and Azhar, Faisal and others. {LLaMA}: Open and efficient foundation language models. {a}rXiv preprint \href{https://arxiv.org/abs/2302.13971}{arXiv:2302.13971}. 2023

5. Yang, Aiyuan and Xiao, Bin and Wang, Bingning and Zhang, Borong and Bian, Ce and Yin, Chao and Lv, Chenxu and Pan, Da and Wang, Dian and Yan, Dong and others. Baichuan 2: Open large-scale language models. {a}rXiv preprint \href{https://arxiv.org/abs/2309.10305}{arXiv:2309.10305}. 2023

Cited by 2 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3