Affiliation:
1. Professor of Hebrew and Arabic, The Hebrew University of Jerusalem Jerusalem Israel
2. Chairperson, Becker Consulting Ltd. Israel
Abstract
Abstract
The Tagged Algerian Judeo-Arabic (TAJA) corpus is the first linguistically annotated corpus of any Judeo-Arabic dialect regardless of geography and period. The corpus is a genre-diverse collection of written Modern Algerian Judeo-Arabic texts, encompassing translations of the Bible and of liturgical texts, commentaries and original Judeo-Arabic books and journals. The TAJA corpus was manually annotated with parts-of-speech (POS) tags and detailed morphology tags. The goal of the new corpus is twofold. First, it preserves this endangered Judeo-Arabic language, expanding on previous fieldwork and going beyond the study of individual written texts. The corpus has already enabled us to make strides towards a grammar of written Algerian Judeo-Arabic. Second, this tagged corpus serves as a foundation for the development of Judeo-Arabic-specific Natural Language Processing (NLP) tools, which allow automatic POS tagging and morphological annotation of large collections of yet untapped texts in Algerian Judeo-Arabic and other Judeo-Arabic varieties.
Subject
Linguistics and Language,History,Language and Linguistics,Cultural Studies
Reference97 articles.
1. Abidi, Karima, Mohamed Amine Menacer, & Kamel Smaili. 2017. “CALYOU: A Comparable Spoken Algerian Corpus Harvested from YouTube.” 18th Annual Conference of the International Communication Association (Interspeech), Stockholm, Sweden.
10.21437/Interspeech.2017-1305
2. XML Annotation of Hebrew Elements in Judeo-Arabic Texts;Ahmed, Mohamed A. H.
3. 18th-Century Judeo-Arabic Documents from the Prize Papers Collection;Ahmed, Mohamed A. H.
4. Alansary, Sameh, Magdy Nagi, & Noha Adly. 2007. “Building an International Corpus of Arabic (ICA): Progress of Compilation Stage.” In Proceedings of the 7th International Conference on Language Engineering, Cairo.
5. Almeman, Khalid & Mark Lee. 2013. “Automatic Building of Arabic Multi Dialect Text Corpora by Bootstrapping Dialect Words.” 1st ICCSPA Conference, Sharjah, 1–6.