Abstract
AbstractArabic corpora have existed since the last decade of the past century. Although they are constantly increasing, more advanced tools and morpho-syntactically annotated Arabic corpora are still needed for research and teaching. Likewise, parallel and specialised corpora are rare despite the growing need to use them in empirical linguistic investigations of authentic Arabic texts and for language and translation teaching. Therefore, building legal corpora will pave the way for more research in Arabic legal translation, an area which is under-researched worldwide. This paper aims to discuss the building of a collection of specialised parallel and monolingual legal corpora. In particular, it will discuss the building of diachronic corpora, which include all available constitutions of 22 Arabic countries. The aim of building all available versions of these constitutions is two-fold: (1) interdisciplinary corpus-based and socio-cultural investigations and (2) research-led and blended-learning pedagogical approaches to translation teaching and learning. Thus, these corpora are of great value to translation trainers and researchers, law academics and professionals, and governmental, non-governmental and international organisations. The paper will demonstrate the process of building these specialised complex corpora and the challenges encountered throughout this process. Among the challenges faced during the data collection and processing phases are (1) limitations of finding the original constitutions for each Arabic country since some of them date back to 1922; (2) file conversion and the difficulty of choosing one Optical Character Recognition (OCR) tool to rely on for the Arabic language since many lack accuracy, efficiency as well as encoding issues in Arabic.
Funder
Literature, Publishing and Translation Commission, Ministry of Culture, Kingdom of Saudi Arabia
Publisher
Springer Science and Business Media LLC
Subject
Computer Science Applications,Linguistics and Language,Language and Linguistics
Reference67 articles.
1. Abbas, M., & Smaili, K. (2005). ‘Comparison of topic identification methods for the Arabic language’. In Proceedings of International Conference on Recent Advances in Natural Language Processing, RANLP pp 14-17.
2. Ahmad, A. A. S., Hammo, B., & Yagi, S. (2017). ‘Construction of an English-Arabic Political Parallel Corpus’ New Trends in Information Technology (NTIT)–2017, 2, 93. pp 157-171.
3. Ahmed, A., Ali, N, Alzubaidi, M. Zaghouani, W. Abd-alrazaq, A., Househ, M. (2022). ‘Free and Accessible Arabic Corpora: A Scoping Review’, Computer Methods and Programs in Biomedicine Update, 100049. Available from https://www.sciencedirect.com/science/article/pii/S2666990022000015 [Accessed 8 February 2023]
4. Al-Ajmi, H. (2004). A new english–arabic parallel text corpus for lexicographic applications. Lexikos, 14, 326–330.
5. Alansary, S., & Nagi, M. (2014). ‘The international corpus of Arabic: Compilation, analysis and evaluation’. In Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing ANLP, pp. 8-17.