Affiliation:
1. Wuhan Vocational College of Software and Engineering, Wuhan, Hubei, China
2. School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, China
Abstract
News text is an important branch of natural language processing. Compared to ordinary texts, news text has significant economic and scientific value. The characteristics of news text include structural hierarchy, diverse label categories, and limited high-quality annotation samples. Many machine learning and deep learning methods exist to analyze various forms of news text. However, due to label imbalance, hierarchical semantics, and confusing labels, current methods have limitations. Therefore, this paper proposes a news text classification framework based on hierarchical semantics and prior correction (HSPC). Firstly, data augmentation is used to enhance the diversity of the training set and adversarial learning is employed to improve the resistance of the model with its robustness. Then, a hierarchical feature extraction approach is employed to extract semantic features from different levels of news texts. Consequentially, a feature fusion method is designed to allow the model to focus on relevant hierarchical semantics for label classification. Finally, highly confusing label predictions are corrected to optimize the label prediction of the model and improve confidence. Multiple experiments are performed on four widely used public datasets. The experimental results indicate that HSPC achieves higher classification accuracy compared to other models. On the FCT, AGNews, THUCNews, and Ohsumed datasets, HSPC improves the accuracy by 1.03%, 1.38%, 2.55%, and 1.15%, respectively, compared to state-of-the-art methods. This validates the rationality and effectiveness of the designed mechanisms.