WORDS VERSUS CHARACTER N-GRAMS FOR ANTI-SPAM FILTERING-Reference-Cited by-同舟云学术

WORDS VERSUS CHARACTER N-GRAMS FOR ANTI-SPAM FILTERING

Published:2007-12 Issue:06 Volume:16 Page:1047-1067
ISSN:0218-2130
Container-title:International Journal on Artificial Intelligence Tools
language:en
Short-container-title:Int. J. Artif. Intell. Tools

Author:

KANARIS IOANNIS¹,KANARIS KONSTANTINOS¹,HOUVARDAS IOANNIS¹,STAMATATOS EFSTATHIOS¹

Affiliation:

1. Department of Information and Communication Systems Eng., University of the Aegean, Karlovassi, Samos – 83200, Greece

Abstract

The increasing number of unsolicited e-mail messages (spam) reveals the need for the development of reliable anti-spam filters. The vast majority of content-based techniques rely on word-based representation of messages. Such approaches require reliable tokenizers for detecting the token boundaries. As a consequence, a common practice of spammers is to attempt to confuse tokenizers using unexpected punctuation marks or special characters within the message. In this paper we explore an alternative low-level representation based on character n-grams which avoids the use of tokenizers and other language-dependent tools. Based on experiments on two well-known benchmark corpora and a variety of evaluation measures, we show that character n-grams are more reliable features than word-tokens despite the fact that they increase the dimensionality of the problem. Moreover, we propose a method for extracting variable-length n-grams which produces optimal classifiers among the examined models under cost-sensitive evaluation.

Publisher

World Scientific Pub Co Pte Lt

Subject

Artificial Intelligence,Artificial Intelligence

Link

https://www.worldscientific.com/doi/pdf/10.1142/S0218213007003692

Reference11 articles.

1. Machine learning in automated text categorization

2. Support vector machines for spam categorization

3. The Nature of Statistical Learning Theory

Cited by 56 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Gender Dynamics in Drama Translation: A Stylometric Analysis Through Principal Component Analysis;2024

2. Evaluation of Different Plagiarism Detection Methods: A Fuzzy MCDM Perspective;Applied Sciences;2022-04-30

3. The informational value of multi-attribute online consumer reviews: A text mining approach;Journal of Retailing and Consumer Services;2022-03

4. Language detection using multinomial naïve bayes algorithm;i-manager's Journal on Computer Science;2022

5. A language independent approach to multilingual document representation including Arabic;2021 IEEE/ACS 18th International Conference on Computer Systems and Applications (AICCSA);2021-11