A paper-text perspective-Reference-Cited by-同舟云学术

A paper-text perspective

Published:2017-08-07 Issue:4 Volume:35 Page:689-708
ISSN:0264-0473
Container-title:The Electronic Library
language:en
Short-container-title:EL

Author:

Wang Hao,Deng Sanhong

Abstract

Purpose In the era of Big Data, network digital resources are growing rapidly, especially the short-text resources, such as tweets, comments, messages and so on, are showing a vigorous vitality. This study aims to compare the categories discriminative capacity (CDC) of Chinese language fragments with different granularities and to explore and verify feasibility, rationality and effectiveness of the low-granularity feature, such as Chinese characters in Chinese short-text classification (CSTC). Design/methodology/approach This study takes discipline classification of journal articles from CSSCI as a simulation environment. On the basis of sorting out the distribution rules of classification features with various granularities, including keywords, terms and characters, the classification effects accessed by the SVM algorithm are comprehensively compared and evaluated from three angles of using the same experiment samples, testing before and after feature optimization, and introducing external data. Findings The granularity of a classification feature has an important impact on CSTC. In general, the larger the granularity is, the better the classification result is, and vice versa. However, a low-granularity feature is also feasible, and its CDC could be improved by reasonable weight setting, even exceeding a high-granularity feature if synthetically considering classification precision, computational complexity and text coverage. Originality/value This is the first study to propose that Chinese characters are more suitable as descriptive features in CSTC than terms and keywords and to demonstrate that CDC of Chinese character features could be strengthened by mixing frequency and position as weight.

Publisher

Emerald

Subject

Library and Information Sciences,Computer Science Applications

Reference47 articles.

1. Linguistic techniques to improve the performance of automatic text categorization,2001

2. Feature selection using information gain for improved structural-based alert correlation;PloS One,2016

3. Feature selection for ordinal text classification;Neural Computation,2014

4. Text mining for the vaccine adverse event reporting system: medical text classification using informative feature selection;Journal of the American Medical Informatics Association,2011

5. Using Chi-square statistics to measure similarities for text categorization;Expert Systems with Applications,2011

Cited by 7 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A method of measuring the article discriminative capacity and its distribution;Scientometrics;2022-04-18

2. The Research Trends of Text Classification Studies (2000–2020): A Bibliometric Analysis;SAGE Open;2022-04

3. Text Language Classification Based on Dynamic Word Vector and Attention Mechanism;2021 International Conference on Big Data Analytics for Cyber-Physical System in Smart City;2022

4. Prediction of Obstetric Patient Flow and Horizontal Allocation of Medical Resources Based on Time Series Analysis;Frontiers in Public Health;2021-10-14

5. Web News Data Extraction Technology Based on Text Keywords;Complexity;2021-04-16