Text Classification Using Compression-Based Dissimilarity Measures-Reference-Cited by-同舟云学术

Text Classification Using Compression-Based Dissimilarity Measures

Published:2015-07-09 Issue:05 Volume:29 Page:1553004
ISSN:0218-0014
Container-title:International Journal of Pattern Recognition and Artificial Intelligence
language:en
Short-container-title:Int. J. Patt. Recogn. Artif. Intell.

Author:

Coutinho David Pereira¹,Figueiredo Mário A. T.²

Affiliation:

1. Instituto de Telecomunicações and Instituto Superior de Engenharia de Lisboa (ISEL), Instituto Politécnico de Lisboa, 1959-007 Lisboa, Portugal

2. Instituto de Telecomunicações and Instituto Superior Técnico, Universidade de Lisboa, 1049-001 Lisboa, Portugal

Abstract

Arguably, the most difficult task in text classification is to choose an appropriate set of features that allows machine learning algorithms to provide accurate classification. Most state-of-the-art techniques for this task involve careful feature engineering and a pre-processing stage, which may be too expensive in the emerging context of massive collections of electronic texts. In this paper, we propose efficient methods for text classification based on information-theoretic dissimilarity measures, which are used to define dissimilarity-based representations. These methods dispense with any feature design or engineering, by mapping texts into a feature space using universal dissimilarity measures; in this space, classical classifiers (e.g. nearest neighbor or support vector machines) can then be used. The reported experimental evaluation of the proposed methods, on sentiment polarity analysis and authorship attribution problems, reveals that it approximates, sometimes even outperforms previous state-of-the-art techniques, despite being much simpler, in the sense that they do not require any text pre-processing or feature engineering.

Publisher

World Scientific Pub Co Pte Lt

Subject

Artificial Intelligence,Computer Vision and Pattern Recognition,Software

Link

https://www.worldscientific.com/doi/pdf/10.1142/S0218001415530043

Reference39 articles.

1. Language Trees and Zipping

2. Information distance

3. On the Length of Programs for Computing Finite Binary Sequences

4. Clustering by Compression

Cited by 9 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Classifying Source Code: How Far Can Compressor-based Classifiers Go?;Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings;2024-04-14

2. Visual Analysis of Research Paper Collections Using Normalized Relative Compression;Entropy;2019-06-21

3. Construction of Efficient V-Gram Dictionary for Sequential Data Analysis;Proceedings of the 27th ACM International Conference on Information and Knowledge Management;2018-10-17

4. Extended-alphabet finite-context models;Pattern Recognition Letters;2018-09

5. Comparison of Compression-Based Measures with Application to the Evolution of Primate Genomes;Entropy;2018-05-23