Can cross-domain term extraction benefit from cross-lingual transfer and nested term labeling?
-
Published:2024-03-27
Issue:7
Volume:113
Page:4285-4314
-
ISSN:0885-6125
-
Container-title:Machine Learning
-
language:en
-
Short-container-title:Mach Learn
Author:
Tran Hanh Thi Hong, Martinc Matej, Repar Andraz, Ljubešić Nikola, Doucet Antoine, Pollak SenjaORCID
Abstract
AbstractAutomatic term extraction (ATE) is a natural language processing task that eases the effort of manually identifying terms from domain-specific corpora by providing a list of candidate terms. In this paper, we treat ATE as a sequence-labeling task and explore the efficacy of XLMR in evaluating cross-lingual and multilingual learning against monolingual learning in the cross-domain ATE context. Additionally, we introduce NOBI, a novel annotation mechanism enabling the labeling of single-word nested terms. Our experiments are conducted on the ACTER corpus, encompassing four domains and three languages (English, French, and Dutch), as well as the RSDO5 Slovenian corpus, encompassing four additional domains. Results indicate that cross-lingual and multilingual models outperform monolingual settings, showcasing improved F1-scores for all languages within the ACTER dataset. When incorporating an additional Slovenian corpus into the training set, the multilingual model exhibits superior performance compared to state-of-the-art approaches in specific scenarios. Moreover, the newly introduced NOBI labeling mechanism enhances the classifier’s capacity to extract short nested terms significantly, leading to substantial improvements in Recall for the ACTER dataset and consequentially boosting the overall F1-score performance.
Funder
Javna Agencija za Raziskovalno Dejavnost RS Republic of Slovenia and the European Union Région Nouvelle Aquitaine Campus France
Publisher
Springer Science and Business Media LLC
Reference43 articles.
1. Akbik, A., Bergmann, T., Blythe, D., Rasul, K., Schweter, S., & Vollgraf, R. (2019). Flair: An easy-to-use framework for state-of-the-art nlp. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations) (pp. 54–59). 2. Amjadian, E., Inkpen, D., Paribakht, T., & Faez, F. (2016). Local-Global Vectors to Improve Unigram Terminology Extraction. In Proceedings of the 5th International Workshop on Computational Terminology (Computerm2016) (pp. 2–11). 3. Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2020). Unsupervised cross-lingual representation learning at scale. In ACL. 4. Daille, B., Gaussier, É., & Langé, J. M. (1994). Towards Automatic Extraction of Monolingual and Bilingual Terminology. In COLING 1994 Volume 1: The 15th International Conference on Computational Linguistics. 5. Damerau, F. J. (1990). Evaluating computer-generated domain-oriented vocabularies. Information Processing and Management, 26(6), 791–801.
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
|
|