Affiliation:
1. Université de Nantes, Cedex, France
2. Okayama University, Okayama, Japan
3. The University of Tokyo, Bunkyo-ku, Tokyo, Japan
Abstract
Current research in text mining favors the quantity of texts over their representativeness. But for bilingual terminology mining, and for many language pairs, large comparable corpora are not available. More importantly, as terms are defined vis-à-vis a specific domain with a restricted register, it is expected that the representativeness rather than the quantity of the corpus matters more in terminology mining. Our hypothesis, therefore, is that the representativeness of the corpus is more important than the quantity and ensures the quality of the acquired terminological resources. This article tests this hypothesis on a French-Japanese bilingual term extraction task. To demonstrate how important the type of discourse is as a characteristic of the comparable corpora, we used a state-of-the-art multilingual terminology mining chain composed of two extraction programs, one in each language, and an alignment program. We evaluated the candidate translations using a reference list, and found that taking discourse type into account resulted in candidate translations of a better quality even when the corpus size was reduced by half.
Funder
Agence Nationale de la Recherche
Japan Society for the Promotion of Science
Publisher
Association for Computing Machinery (ACM)
Subject
Computational Mathematics,Computer Science (miscellaneous)
Reference49 articles.
1. Morphosyntaxe et genres textuels. Exploiter des données morphosyntaxiques pour l'étude statistique des genres textuels : Application au roman policier;Beauvisage T.;Traitement Autom. Lang.,2001
2. Dimensions of Register Variation
3. Working with Specialized Language
Cited by
5 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献