Generating Synthetic Resume Data with Large Language Models for Enhanced Job Description Classification-Reference-Cited by-同舟云学术

Generating Synthetic Resume Data with Large Language Models for Enhanced Job Description Classification

Published:2023-11-09 Issue:11 Volume:15 Page:363
ISSN:1999-5903
Container-title:Future Internet
language:en
Short-container-title:Future Internet

Author:

Skondras Panagiotis¹,Zervas Panagiotis¹,Tzimas Giannis¹

Affiliation:

1. Data and Media Laboratory, Department of Electrical and Computer Engineering, University of Peloponnese, 22100 Tripoli, Greece

Abstract

In this article, we investigate the potential of synthetic resumes as a means for the rapid generation of training data and their effectiveness in data augmentation, especially in categories marked by sparse samples. The widespread implementation of machine learning algorithms in natural language processing (NLP) has notably streamlined the resume classification process, delivering time and cost efficiencies for hiring organizations. However, the performance of these algorithms depends on the abundance of training data. While selecting the right model architecture is essential, it is also crucial to ensure the availability of a robust, well-curated dataset. For many categories in the job market, data sparsity remains a challenge. To deal with this challenge, we employed the OpenAI API to generate both structured and unstructured resumes tailored to specific criteria. These synthetically generated resumes were cleaned, preprocessed and then utilized to train two distinct models: a transformer model (BERT) and a feedforward neural network (FFNN) that incorporated Universal Sentence Encoder 4 (USE4) embeddings. While both models were evaluated on the multiclass classification task of resumes, when trained on an augmented dataset containing 60 percent real data (from Indeed website) and 40 percent synthetic data from ChatGPT, the transformer model presented exceptional accuracy. The FFNN, albeit predictably, achieved lower accuracy. These findings highlight the value of augmented real-world data with ChatGPT-generated synthetic resumes, especially in the context of limited training data. The suitability of the BERT model for such classification tasks further reinforces this narrative.

Publisher

MDPI AG

Subject

Computer Networks and Communications

Link

https://www.mdpi.com/1999-5903/15/11/363/pdf

Reference33 articles.

1. Ye, J., Chen, X., Xu, N., Zu, C., Shao, Z., Liu, S., Cui, Y., Zhou, Z., Gong, C., and Shen, Y. (2023). A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models. arXiv.

2. Kuchnik, M., Smith, V., and Amvrosiadis, G. (2022). Validating Large Language Models with ReLM. ArXiv [Cs.LG]. arXiv.

3. (2023, September 29). OpenAI API. Available online: https://bit.ly/3UOELSX.

4. White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., Elnashar, A., Spencer-Smith, J., and Schmidt, D. (2023). A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT. arXiv.

5. Interactive and Visual Prompt Engineering for Ad-hoc Task Adaptation with Large Language Models;Strobelt;IEEE Trans. Vis. Comput. Graph.,2023

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Enhancing Imbalanced Sentiment Analysis: A GPT-3-Based Sentence-by-Sentence Generation Approach;Applied Sciences;2024-01-11

2. Unlocking the Potential: A Comprehensive Systematic Review of ChatGPT in Natural Language Processing Tasks;Computer Modeling in Engineering & Sciences;2024