Unifying Sentence Transformer Embedding and Softmax Voting Ensemble for Accurate News Category Prediction

Author:

Khosa Saima12,Mehmood Arif1ORCID,Rizwan Muhammad2ORCID

Affiliation:

1. Department of Information Security, The Islamia University of Bahawalpur, Bahawalpur 63100, Pakistan

2. Department of Information Technology, Khwaja Fareed University of Engineering and Information Technology, Rahim Yar Khan 64200, Pakistan

Abstract

The study focuses on news category prediction and investigates the performance of sentence embedding of four transformer models (BERT, RoBERTa, MPNet, and T5) and their variants as feature vectors when combined with Softmax and Random Forest using two accessible news datasets from Kaggle. The data are stratified into train and test sets to ensure equal representation of each category. Word embeddings are generated using transformer models, with the last hidden layer selected as the embedding. Mean pooling calculates a single vector representation called sentence embedding, capturing the overall meaning of the news article. The performance of Softmax and Random Forest, as well as the soft voting of both, is evaluated using evaluation measures such as accuracy, F1 score, precision, and recall. The study also contributes by evaluating the performance of Softmax and Random Forest individually. The macro-average F1 score is calculated to compare the performance of different transformer embeddings in the same experimental settings. The experiments reveal that MPNet versions v1 and v3 achieve the highest F1 score of 97.7% when combined with Random Forest, while T5 Large embedding achieves the highest F1 score of 98.2% when used with Softmax regression. MPNet v1 performs exceptionally well when used in the voting classifier, obtaining an impressive F1 score of 98.6%. In conclusion, the experiments validate the superiority of certain transformer models, such as MPNet v1, MPNet v3, and DistilRoBERTa, when used to calculate sentence embeddings within the Random Forest framework. The results also highlight the promising performance of T5 Large and RoBERTa Large in voting of Softmax regression and Random Forest. The voting classifier, employing transformer embeddings and ensemble learning techniques, consistently outperforms other baselines and individual algorithms. These findings emphasize the effectiveness of the voting classifier with transformer embeddings in achieving accurate and reliable predictions for news category classification tasks.

Publisher

MDPI AG

Subject

Computer Networks and Communications,Human-Computer Interaction

Reference47 articles.

1. Karaman, Y., Akdeniz, F., Savaş, B.K., and Becerikli, Y. (2022, January 19–21). A Comparative Analysis of SVM, LSTM and CNN-RNN Models for the BBC News Classification. Proceedings of the 7th International Conference on Smart City Applications, Castelo Branco, Portugal.

2. Gupta, A., Chugh, D., and Katarya, R. (2022). Sustainable Advanced Computing: Select Proceedings of ICSAC 2021, Springer.

3. Ding, H., Yang, J., Deng, Y., Zhang, H., and Roth, D. (2023). Towards open-domain topic classification. arXiv.

4. Recommendation of effectiveness of YouTube video contents by qualitative sentiment analysis of its comments and replies;Nawaz;Pak. J. Sci.,2019

5. Deep learning for patent landscaping using transformer and graph embedding;Choi;Technol. Forecast. Soc. Chang.,2022

Cited by 1 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Application of Nepali Large Language Models to Improve Sentiment Analysis;Proceedings of the 2024 7th International Conference on Computers in Management and Business;2024-01-12

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3