Author:
Zeroual Imad,Lakhouaja Abdelhak
Abstract
Recently, more data-driven approaches are demanding multilingual parallel resources primarily in the cross-language studies. To meet these demands, building multilingual parallel corpora are becoming the focus of many Natural Language Processing (NLP) scientific groups. Unlike monolingual corpora, the number of available multilingual parallel corpora is limited. In this paper, the MulTed, a corpus of subtitles extracted from TEDx talks is introduced. It is multilingual, Part of Speech (PoS) tagged, and bilingually sentence-aligned with English as a pivot language. This corpus is designed for many NLP applications, where the sentence-alignment, the PoS tagging, and the size of corpora are influential such as statistical machine translation, language recognition, and bilingual dictionary generation. Currently, the corpus has subtitles that cover 1100 talks available in over 100 languages. The subtitles are classified based on a variety of topics such as Business, Education, and Sport. Regarding the PoS tagging, the Treetagger, a language-independent PoS tagger, is used; then, to make the PoS tagging maximally useful, a mapping process to a universal common tagset is performed. Finally, we believe that making the MulTed corpus available for a public use can be a significant contribution to the literature of NLP and corpus linguistics, especially for under-resourced languages.
Subject
Computer Science Applications,Information Systems,Software
Reference32 articles.
1. Building a multilingual parallel subtitle corpus,2007
2. The efficacy of human post-editing for language translation,2013
3. Multilingual part-of-speech tagging: two unsupervised approaches;J. Artif. Intell. Res.,2009
4. J. Sylak-Glassman, C. Kirov, M. Post, R. Que, D. Yarowsky, A universal feature schema for rich morphological annotation and fine-grained cross-lingual part-of-speech tagging, in: Int. Workshop Syst. Framew. Comput. Morphol., Springer, 2015, pp. 72–93.
5. Probabilistic part-ofispeech tagging using decision trees,2013
Cited by
7 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Differences in Urban Development in China from the Perspective of Point of Interest Spatial Co-Occurrence Patterns;ISPRS International Journal of Geo-Information;2024-01-10
2. Korean-Centered Cross-Lingual Parallel Sentence Corpus Construction Experiment;2023 International Conference on Asian Language Processing (IALP);2023-11-18
3. Construction of Mizo: English Parallel Corpus for Machine Translation;ACM Transactions on Asian and Low-Resource Language Information Processing;2023-08-24
4. A cross-lingual video classification using subtitles;2022 2nd International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET);2022-03-03
5. Bilingual Summarization of English and Arabic Genetic Diseases Texts;International Journal for Innovation Education and Research;2021-09-01