A Systematic Review of Synthetic Data Generation Techniques Using Generative AI

Author:

Goyal Mandeep1,Mahmoud Qusay H.1ORCID

Affiliation:

1. Department of Electrical, Computer and Software Engineering, Ontario Tech University, Oshawa, ON L1G 0C5, Canada

Abstract

Synthetic data are increasingly being recognized for their potential to address serious real-world challenges in various domains. They provide innovative solutions to combat the data scarcity, privacy concerns, and algorithmic biases commonly used in machine learning applications. Synthetic data preserve all underlying patterns and behaviors of the original dataset while altering the actual content. The methods proposed in the literature to generate synthetic data vary from large language models (LLMs), which are pre-trained on gigantic datasets, to generative adversarial networks (GANs) and variational autoencoders (VAEs). This study provides a systematic review of the various techniques proposed in the literature that can be used to generate synthetic data to identify their limitations and suggest potential future research areas. The findings indicate that while these technologies generate synthetic data of specific data types, they still have some drawbacks, such as computational requirements, training stability, and privacy-preserving measures which limit their real-world usability. Addressing these issues will facilitate the broader adoption of synthetic data generation techniques across various disciplines, thereby advancing machine learning and data-driven solutions.

Publisher

MDPI AG

Reference87 articles.

1. Challenges of Big Data Analysis;Fan;Natl. Sci. Rev.,2014

2. Fhom, H. (2015, January 21–23). Big Data: Opportunities and Privacy Challenges. Proceedings of the International Conference on Information Systems and Management Science, Karlsruhe, Germany.

3. Poucin, F., Kraus, A., and Simon, M. (2021, January 11–17). Synthetic data shows promising properties to boost the performance of Deep Neural Networks on real-world instance segmentation. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Montreal, BC, Canada.

4. Abowd, J.M., and Vilhuber, L. (2008). How Protective Are Synthetic Data?. Privacy in Statistical Databases, Springer.

5. Jävergård, N., Lyons, R., Muntean, A., and Forsman, J. (2024). Preserving correlations: A Statistical Method for Generating Synthetic Data. arXiv.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3