Unifying Sentence Transformer Embedding and Softmax Voting Ensemble for Accurate News Category Prediction-Reference-Cited by-同舟云学术

Unifying Sentence Transformer Embedding and Softmax Voting Ensemble for Accurate News Category Prediction

Published:2023-07-08 Issue:7 Volume:12 Page:137
ISSN:2073-431X
Container-title:Computers
language:en
Short-container-title:Computers

Author:

Khosa Saima¹²,Mehmood Arif¹^ORCID,Rizwan Muhammad²^ORCID

Affiliation:

1. Department of Information Security, The Islamia University of Bahawalpur, Bahawalpur 63100, Pakistan

2. Department of Information Technology, Khwaja Fareed University of Engineering and Information Technology, Rahim Yar Khan 64200, Pakistan

Abstract

The study focuses on news category prediction and investigates the performance of sentence embedding of four transformer models (BERT, RoBERTa, MPNet, and T5) and their variants as feature vectors when combined with Softmax and Random Forest using two accessible news datasets from Kaggle. The data are stratified into train and test sets to ensure equal representation of each category. Word embeddings are generated using transformer models, with the last hidden layer selected as the embedding. Mean pooling calculates a single vector representation called sentence embedding, capturing the overall meaning of the news article. The performance of Softmax and Random Forest, as well as the soft voting of both, is evaluated using evaluation measures such as accuracy, F1 score, precision, and recall. The study also contributes by evaluating the performance of Softmax and Random Forest individually. The macro-average F1 score is calculated to compare the performance of different transformer embeddings in the same experimental settings. The experiments reveal that MPNet versions v1 and v3 achieve the highest F1 score of 97.7% when combined with Random Forest, while T5 Large embedding achieves the highest F1 score of 98.2% when used with Softmax regression. MPNet v1 performs exceptionally well when used in the voting classifier, obtaining an impressive F1 score of 98.6%. In conclusion, the experiments validate the superiority of certain transformer models, such as MPNet v1, MPNet v3, and DistilRoBERTa, when used to calculate sentence embeddings within the Random Forest framework. The results also highlight the promising performance of T5 Large and RoBERTa Large in voting of Softmax regression and Random Forest. The voting classifier, employing transformer embeddings and ensemble learning techniques, consistently outperforms other baselines and individual algorithms. These findings emphasize the effectiveness of the voting classifier with transformer embeddings in achieving accurate and reliable predictions for news category classification tasks.

Publisher

MDPI AG

Subject

Computer Networks and Communications,Human-Computer Interaction

Link

https://www.mdpi.com/2073-431X/12/7/137/pdf

Reference47 articles.

1. Karaman, Y., Akdeniz, F., Savaş, B.K., and Becerikli, Y. (2022, January 19–21). A Comparative Analysis of SVM, LSTM and CNN-RNN Models for the BBC News Classification. Proceedings of the 7th International Conference on Smart City Applications, Castelo Branco, Portugal.

2. Gupta, A., Chugh, D., and Katarya, R. (2022). Sustainable Advanced Computing: Select Proceedings of ICSAC 2021, Springer.

3. Ding, H., Yang, J., Deng, Y., Zhang, H., and Roth, D. (2023). Towards open-domain topic classification. arXiv.

4. Recommendation of effectiveness of YouTube video contents by qualitative sentiment analysis of its comments and replies;Nawaz;Pak. J. Sci.,2019

5. Deep learning for patent landscaping using transformer and graph embedding;Choi;Technol. Forecast. Soc. Chang.,2022

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Application of Nepali Large Language Models to Improve Sentiment Analysis;Proceedings of the 2024 7th International Conference on Computers in Management and Business;2024-01-12