Automatically extracted parallel corpora enriched with highly useful metadata? A Wikipedia case study combining machine learning and social technology
Author:
Aghaebrahimian Ahmad1,
Stauder Andy1,
Ustaszewski Michael1ORCID
Affiliation:
1. Department of Translation Studies, University of Innsbruck, Austria
Abstract
Abstract
The extraction of large amounts of multilingual parallel text from web resources is a widely used technique in natural language processing. However, automatically collected parallel corpora usually lack precise metadata, which are crucial to accurate data analysis and interpretation. The combination of automated extraction procedures and manual metadata enrichment may help address this issue. Wikipedia is a promising candidate for the exploration of the potential of said combination of methods because it is a rich source of translations in a large number of language pairs and because its open and collaborative nature makes it possible to identify and contact the users who produce translations. This article tests to what extent translated texts automatically extracted from Wikipedia by means of neural networks can be enriched with pertinent metadata through a self-submission-based user survey. Special emphasis is placed on data usefulness, defined in terms of a catalogue of previously established assessment criteria, most prominently metadata quality. The results suggest that from a quantitative perspective, the proposed methodology is capable of capturing metadata otherwise not available. At the same time, the crowd-based collection of data and metadata may face important technical and social limitations.
Funder
TransBank: A Meta-Corpus for Translation Research
Austrian Academy of Sciences
Publisher
Oxford University Press (OUP)
Subject
Computer Science Applications,Linguistics and Language,Language and Linguistics,Information Systems
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. The Design of English Translation Software Based on Machine Learning Technology;2022 5th Asia Conference on Machine Learning and Computing (ACMLC);2022-12
2. Wikipedia and translation;The Routledge Handbook of Translation and Media;2021-11-29