TAJA Corpus: Linguistically Tagged Written Algerian Judeo-Arabic Corpus

Author:

Tirosh-Becker Ofra1ORCID,Becker Oren M.2

Affiliation:

1. Professor of Hebrew and Arabic, The Hebrew University of Jerusalem Jerusalem Israel

2. Chairperson, Becker Consulting Ltd. Israel

Abstract

Abstract The Tagged Algerian Judeo-Arabic (TAJA) corpus is the first linguistically annotated corpus of any Judeo-Arabic dialect regardless of geography and period. The corpus is a genre-diverse collection of written Modern Algerian Judeo-Arabic texts, encompassing translations of the Bible and of liturgical texts, commentaries and original Judeo-Arabic books and journals. The TAJA corpus was manually annotated with parts-of-speech (POS) tags and detailed morphology tags. The goal of the new corpus is twofold. First, it preserves this endangered Judeo-Arabic language, expanding on previous fieldwork and going beyond the study of individual written texts. The corpus has already enabled us to make strides towards a grammar of written Algerian Judeo-Arabic. Second, this tagged corpus serves as a foundation for the development of Judeo-Arabic-specific Natural Language Processing (NLP) tools, which allow automatic POS tagging and morphological annotation of large collections of yet untapped texts in Algerian Judeo-Arabic and other Judeo-Arabic varieties.

Publisher

Brill

Subject

Linguistics and Language,History,Language and Linguistics,Cultural Studies

Reference97 articles.

1. Abidi, Karima, Mohamed Amine Menacer, & Kamel Smaili. 2017. “CALYOU: A Comparable Spoken Algerian Corpus Harvested from YouTube.” 18th Annual Conference of the International Communication Association (Interspeech), Stockholm, Sweden. 10.21437/Interspeech.2017-1305

2. XML Annotation of Hebrew Elements in Judeo-Arabic Texts;Ahmed, Mohamed A. H.

3. 18th-Century Judeo-Arabic Documents from the Prize Papers Collection;Ahmed, Mohamed A. H.

4. Alansary, Sameh, Magdy Nagi, & Noha Adly. 2007. “Building an International Corpus of Arabic (ICA): Progress of Compilation Stage.” In Proceedings of the 7th International Conference on Language Engineering, Cairo.

5. Almeman, Khalid & Mark Lee. 2013. “Automatic Building of Arabic Multi Dialect Text Corpora by Bootstrapping Dialect Words.” 1st ICCSPA Conference, Sharjah, 1–6.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3