Automated learning of decision rules for text categorization

Author:

Apté Chidanand1,Damerau Fred1,Weiss Sholom M.2

Affiliation:

1. IBM T. J. Watson Research Center, Yorktown Heights, NY

2. Rutgers Univ., New Brunswick, NJ

Abstract

We describe the results of extensive experiments using optimized rule-based induction methods on large document collections. The goal of these methods is to discover automatically classification patterns that can be used for general document categorization or personalized filtering of free text. Previous reports indicate that human-engineered rule-based systems, requiring many man-years of developmental efforts, have been successfully built to “read” documents and assign topics to them. We show that machine-generated decision rules appear comparable to human performance, while using the identical rule-based representation. In comparison with other machine-learning techniques, results on a key benchmark from the Reuters collection show a large gain in performance, from a previously reported 67% recall/precision breakeven point to 80.5%. In the context of a very high-dimensional feature space, several methodological alternatives are examined, including universal versus local dictionaries, and binary versus frequency-related features.

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Science Applications,General Business, Management and Accounting,Information Systems

Reference24 articles.

1. The automatic indexing system AIR/PHYS - from research to applications

2. BREIMAN L. FRIEDMAN J. OLSHEN R. AND STONE C. 1984. Class~f~catwn and Regresszon Trees. Wadsworth Monterey Calif BREIMAN L. FRIEDMAN J. OLSHEN R. AND STONE C. 1984. Class~f~catwn and Regresszon Trees. Wadsworth Monterey Calif

3. CLARK P. AND NIBLETT T. 1989. The CN2 induction algorithm Mach Learn. 3 261-283 10.1023/A:1022641700528 CLARK P. AND NIBLETT T. 1989. The CN2 induction algorithm Mach Learn. 3 261-283 10.1023/A:1022641700528

Cited by 400 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Regional bias in monolingual English language models;Machine Learning;2024-07-09

2. CoocNet: a novel approach to multi-label text classification with improved label co-occurrence modeling;Applied Intelligence;2024-07-02

3. An optimal feature selection method for text classification through redundancy and synergy analysis;Multimedia Tools and Applications;2024-06-28

4. Chinese Fraudulent Text Message Detection Based on Graph Neural Networks;2024 6th International Conference on Communications, Information System and Computer Engineering (CISCE);2024-05-10

5. Anchor graph-based multiview spectral clustering;Neurocomputing;2024-05

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3