Author:
DEL GAUDIO ROSA,BATISTA GUSTAVO,BRANCO ANTÓNIO
Abstract
AbstractThis paper addresses the task of automatic extraction of definitions by thoroughly exploring an approach that solely relies on machine learning techniques, and by focusing on the issue of the imbalance of relevant datasets. We obtained a breakthrough in terms of the automatic extraction of definitions, by extensively and systematically experimenting with different sampling techniques and their combination, as well as a range of different types of classifiers. Performance consistently scored in the range of 0.95–0.99 of area under the receiver operating characteristics, with a notorious improvement between 17 and 22 percentage points regarding the baseline of 0.73–0.77, for datasets with different rates of imbalance. Thus, the present paper also represents a contribution to the seminal work in natural language processing that points toward the importance of exploring the research path of applying sampling techniques to mitigate the bias induced by highly imbalanced datasets, and thus greatly improving the performance of a large range of tools that rely on them.
Publisher
Cambridge University Press (CUP)
Subject
Artificial Intelligence,Linguistics and Language,Language and Linguistics,Software
Reference70 articles.
1. Asymptotic Properties of Nearest Neighbor Rules Using Edited Data
2. Westerhout E. , and Monachesi P. 2008. Creating glossaries using pattern-based and machine learning techniques. In Proceedings of the International Conference on Language Resources and Evaluation, pp. 3074–81.
Cited by
11 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献