Affiliation:
1. Department of Mathematics and Computer Science, University of Cagliari, 09124 Cagliari, Italy
2. Department of Computer Science and Engineering, University of Bologna, 40126 Bologna, Italy
3. Joint Research Centre (DG JRC), European Commission, 1050 Brussels, Belgium
Abstract
Generating synthetic data is a complex task that necessitates accurately replicating the statistical and mathematical properties of the original data elements. In sectors such as finance, utilizing and disseminating real data for research or model development can pose substantial privacy risks owing to the inclusion of sensitive information. Additionally, authentic data may be scarce, particularly in specialized domains where acquiring ample, varied, and high-quality data is difficult or costly. This scarcity or limited data availability can limit the training and testing of machine-learning models. In this paper, we address this challenge. In particular, our task is to synthesize a dataset with similar properties to an input dataset about the stock market. The input dataset is anonymized and consists of very few columns and rows, contains many inconsistencies, such as missing rows and duplicates, and its values are not normalized, scaled, or balanced. We explore the utilization of generative adversarial networks, a deep-learning technique, to generate synthetic data and evaluate its quality compared to the input stock dataset. Our innovation involves generating artificial datasets that mimic the statistical properties of the input elements without revealing complete information. For example, synthetic datasets can capture the distribution of stock prices, trading volumes, and market trends observed in the original dataset. The generated datasets cover a wider range of scenarios and variations, enabling researchers and practitioners to explore different market conditions and investment strategies. This diversity can enhance the robustness and generalization of machine-learning models. We evaluate our synthetic data in terms of the mean, similarities, and correlations.
Reference58 articles.
1. Critical analysis of Big Data challenges and analytical methods;Sivarajah;J. Bus. Res.,2017
2. Consoli, S., Recupero, D.R., and Petkovic, M. (2019). Data Science for Healthcare–Methodologies and Applications, Springer.
3. Big Data and analytics in higher education: Opportunities and challenges;Daniel;Br. J. Educ. Technol.,2015
4. A comprehensive review on Data Stream Mining techniques for data classification; and future trends;Ramzan;EPH-Int. J. Sci. Eng.,2023
5. A survey on deep learning tools dealing with data scarcity: Definitions, challenges, solutions, tips, and applications;Alzubaidi;Big Data,2023