Demystifying Data Management for Large Language Models-Reference-Cited by-同舟云学术

Demystifying Data Management for Large Language Models

Published:2024-06-09 Issue: Volume:35 Page:547-555
ISSN:
Container-title:Companion of the 2024 International Conference on Management of Data
language:
Short-container-title:

Author:

Miao Xupeng¹^ORCID,Jia Zhihao¹^ORCID,Cui Bin²^ORCID

Affiliation:

1. Carnegie Mellon University, Pittsburgh, PA, USA

2. Peking University, Beijing, China

Funder

Amazon Research Award

Cisco Research Award

Qualcomm Innovation Fellowship

NSF awards

Google Faculty Research Award

Samsung GRO Research Award

Oracle Research Award

Meta Research Award

Publisher

ACM

Link

https://dl.acm.org/doi/pdf/10.1145/3626246.3654683

Reference176 articles.

1. Amro Kamal Mohamed Abbas, Kushal Tirumala, Daniel Simig, Surya Ganguli, and Ari S Morcos. 2023. SemDeDup: Data-efficient learning at web-scale through semantic deduplication. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models.

2. Mehdi Ali, Michael Fromm, Klaudia Thellmann, Richard Rutmann, Max Lübbering, Johannes Leveling, Katrin Klug, Jan Ebert, Niclas Doll, Jasper Schulze Buschhoff, et al. 2023. Tokenizer Choice For LLM Training: Negligible or Crucial? arXiv preprint arXiv:2310.08754 (2023).

3. Waseem AlShikh, Manhal Daaboul, Kirk Goddard, Brock Imel, Kiran Kamble, Parikshith Kulkarni, and Melisa Russak. 2023. Becoming self-instruct: introducing early stopping criteria for minimal instruct tuning. arXiv preprint arXiv:2307.03692 (2023).

4. Sotiris Anagnostidis, Dario Pavllo, Luca Biggio, Lorenzo Noci, Aurelien Lucchi, and Thomas Hoffmann. 2023. Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers. arXiv preprint arXiv:2305.15805 (2023).

5. Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes