Abstract
AbstractNamed Entity Recognition is a basic task in Information Extraction that aims at identifying entities of interest within full text documents. The patterns used to recognize entities can be rule based, as in the popular JAPE system. However, hand-crafting effective patterns is often difficult, and yet there is little research devoted to methods capable of learning human-readable patterns, possibly with arbitrary sets of features. In this paper, we present a semi-automatic method to generate both regular expressions and a subset of the JAPE language. It does not need a corpus annotated beforehand. Instead, it employs active learning and combines clustering with an algorithm that finds alignments between symbols present in the entities discovered during the learning process. The method currently supports a fixed set of character features and an arbitrary set of token features, but it can incorporate other kinds of features as well. Through several experiments with an English corpus, we show the ability of the method to generate effective patterns at a low annotation cost, and how it can successfully help in the annotation of brand new corpora.
Publisher
Cambridge University Press (CUP)
Subject
Artificial Intelligence,Linguistics and Language,Language and Linguistics,Software
Reference66 articles.
1. Nouvel D. , Antoine J. Y. , Friburger N. , and Soulet A. 2012. Coupling knowledge-based and data-driven systems for named entity recognition. In Proceedings of the ACL Workshop on Innovative Hybrid Approaches to the Processing of Textual Data, Avignon, France, pp. 69–77.
2. A semi-supervised active learning algorithm for information extraction from textual data
3. Ontology learning for the Semantic Web
4. Asahara M. , and Matsumoto Y. 2003. Japanese named entity extraction with redundant morphological analysis. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, Canada: Edmonton, vol. 1, pp. 8–15.
5. Gantz J. , and Reinsel D. 2012. The digital universe in 2020: big data, bigger digital shadows, and biggest growth in the far east. Technical Report, IDC.
Cited by
4 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献