Corpus Statistics Empowered Document Classification

Author:

Uddin FaridORCID,Chen Yibo,Zhang ZupingORCID,Huang Xin

Abstract

In natural language processing (NLP), document classification is an important task that relies on the proper thematic representation of the documents. Gaussian mixture-based clustering is widespread for capturing rich thematic semantics but ignores emphasizing potential terms in the corpus. Moreover, the soft clustering approach causes long-tail noise by putting every word into every cluster, which affects the natural thematic representation of documents and their proper classification. It is more challenging to capture semantic insights when dealing with short-length documents where word co-occurrence information is limited. In this context, for long texts, we proposed Weighted Sparse Document Vector (WSDV), which performs clustering on the weighted data that emphasizes vital terms and moderates the soft clustering by removing outliers from the converged clusters. Besides the removal of outliers, WSDV utilizes corpus statistics in different steps for the vectorial representation of the document. For short texts, we proposed Weighted Compact Document Vector (WCDV), which captures better semantic insights in building document vectors by emphasizing potential terms and capturing uncertainty information while measuring the affinity between distributions of words. Using available corpus statistics, WCDV sufficiently handles the data sparsity of short texts without depending on external knowledge sources. To evaluate the proposed models, we performed a multiclass document classification using standard performance measures (precision, recall, f1-score, and accuracy) on three long- and two short-text benchmark datasets that outperform some state-of-the-art models. The experimental results demonstrate that in the long-text classification, WSDV reached 97.83% accuracy on the AgNews dataset, 86.05% accuracy on the 20Newsgroup dataset, and 98.67% accuracy on the R8 dataset. In the short-text classification, WCDV reached 72.7% accuracy on the SearchSnippets dataset and 89.4% accuracy on the Twitter dataset.

Funder

Hunan Key Laboratory for Internet of Things in Electricity

National Natural Science Foundation of China

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Computer Networks and Communications,Hardware and Architecture,Signal Processing,Control and Systems Engineering

Reference73 articles.

1. Glove: Global vectors for word representation;Pennington;Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP),2014

2. Extremal GloVe: Theoretically Accurate Distributed Word Embedding by Tail Inference;Wang;Proceedings of the 7th International Conference on Communication and Information Processing (ICCIP),2021

3. From word embeddings to document distances;Kusner;Proceedings of the ICML,2015

4. Efficient vector Representation for Documents through Corruption;Minmin;Proceedings of the International Conference on Learning Representations (ICLR),2017

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3