Business text classification with imbalanced data and moderately large label spaces for digital transformation-Reference-Cited by-同舟云学术

Business text classification with imbalanced data and moderately large label spaces for digital transformation

Published:2024-04-30 Issue:1 Volume:9 Page:
ISSN:2364-8228
Container-title:Applied Network Science
language:en
Short-container-title:Appl Netw Sci

Author:

Arslan Muhammad^ORCID,Cruz Christophe^ORCID

Abstract

AbstractDigital transformation refers to an organization’s use of digital technology to improve its products, services, and operations, aligning them with evolving business requirements. To demonstrate this transformative process, we present a real-life case study where a company seeks to automate the classification of their textual data rather than relying on manual methods. Transitioning to automated classification involves deploying machine learning models, which rely on pre-labeled datasets for training and making predictions on new data. However, upon receiving the dataset from the company, we faced challenges due to the imbalanced distribution of labels and moderately large label spaces. To tackle text classification with such a business dataset, we evaluated four distinct methods for multi-label text classification: fine-tuned Bidirectional Encoder Representations from Transformers (BERT), Binary Relevance, Classifier Chains, and Label Powerset. The results revealed that fine-tuned BERT significantly outperformed the other methods across key metrics like Accuracy, F1-score, Precision, and Recall. Binary Relevance also displayed competence in handling the dataset effectively, while Classifier Chains and Label Powerset exhibited comparatively less impressive performance. These findings highlight the remarkable effectiveness of fine-tuned BERT model and the Binary Relevance classifier in multi-label text classification tasks, particularly when dealing with imbalanced training datasets and moderately large label spaces. This positions them as valuable assets for businesses aiming to automate data classification in the digital transformation era.

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s41109-024-00623-5.pdf

Reference30 articles.

1. Arslan M, Cruz C (2022) Semantic taxonomy enrichment to improve business text classification for dynamic environments. In: 2022 International conference on innovations in intelligent systems and applications (INISTA), IEEE. pp. 1–6, https://doi.org/10.1109/INISTA55318.2022.9894173

2. Arslan M, Cruz C (2023a) Imbalanced multi-label classification for business-related text with moderately large label spaces. arXiv preprint http://arxiv.org/abs/2306.07046

3. Arslan M, Cruz C (2023b) Enabling Digital transformation through business text classification with small datasets. In 2023 15th international conference on innovations in information technology (IIT), IEEE, pp. 38–42. https://doi.org/10.1109/IIT59782.2023.10366487

4. Bogatinovski J, Todorovski L, Džeroski S, Kocev D (2022) Comprehensive comparative study of multi-label classification methods. Expert Syst Appl 203:117215

5. Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint http://arxiv.org/abs/1810.04805