Part-of-speech tagging of Modern Hebrew text-Reference-Cited by-同舟云学术

Part-of-speech tagging of Modern Hebrew text

Published:2008-04 Issue:2 Volume:14 Page:223-251
ISSN:1351-3249
Container-title:Natural Language Engineering
language:en
Short-container-title:Nat. Lang. Eng.

Author:

BAR-HAIM ROY,SIMA'AN KHALIL,WINTER YOAD

Abstract

AbstractWords in Semitic texts often consist of a concatenation ofword segments, each corresponding to a part-of-speech (POS) category. Semitic words may be ambiguous with regard to their segmentation as well as to the POS tags assigned to each segment. When designing POS taggers for Semitic languages, a major architectural decision concerns the choice of the atomic input tokens (terminal symbols). If the tokenization is at the word level, the output tags must be complex, and represent both the segmentation of the word and the POS tag assigned to each word segment. If the tokenization is at the segment level, the input itself must encode the different alternative segmentations of the words, while the output consists of standard POS tags. Comparing these two alternatives is not trivial, as the choice between them may have global effects on the grammatical model. Moreover, intermediate levels of tokenization between these two extremes are conceivable, and, as we aim to show, beneficial. To the best of our knowledge, the problem of tokenization for POS tagging of Semitic languages has not been addressed before in full generality. In this paper, we study this problem for the purpose of POS tagging of Modern Hebrew texts. After extensive error analysis of the two simple tokenization models, we propose a novel, linguistically motivated, intermediate tokenization model that gives better performance for Hebrew over the two initial architectures. Our study is based on the well-known hidden Markov models (HMMs). We start out from a manually devised morphological analyzer and a very small annotated corpus, and describe how to adapt an HMM-based POS tagger for both tokenization architectures. We present an effective technique for smoothing the lexical probabilities using an untagged corpus, and a novel transformation for casting the segment-level tagger in terms of a standard, word-level HMM implementation. The results obtained using our model are on par with the best published results on Modern Standard Arabic, despite the much smaller annotated corpus available for Modern Hebrew.

Publisher

Cambridge University Press (CUP)

Subject

Artificial Intelligence,Linguistics and Language,Language and Linguistics,Software

Reference41 articles.

1. Lee Y. S. , Papineni K. , Roukos S. , Emam O. and Hassan H. 2003. Language model based arabic word segmentation. In ACL ‘03: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, Morristown, NJ, USA, pp. 399–406. East Stroudsburg, PA: Association for Computational Linguistics.

2. Syntactic definiteness in the grammar of Modern Hebrew

3. Buckwalter T. 2002. Buckwalter Arabic Morphological Analyzer Version 1.0. Linguistic Data Consortium (LDC). LDC Catalog No.: LDC2002L49, ISBN:1-58563-257-0.

4. Hakkani-Tür D. , Oflazer K. and Tür G. 2000. Statistical morphological disambiguation for agglutinative languages. In Proceedings of the 18th International Conference on Computational Linguistics (COLING 2000).

Cited by 13 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Semantic and syntactic constraints in resolving homography: a developmental study in Hebrew;Reading and Writing;2021-02-24

2. Tagging Part of Speech in Hausa Sentences;2019 15th International Conference on Electronics, Computer and Computation (ICECCO);2019-12

3. Joint Transition-Based Models for Morpho-Syntactic Parsing: Parsing Strategies for MRLs and a Case Study from Modern Hebrew;Transactions of the Association for Computational Linguistics;2019-03-01

4. An Algorithmic Scheme for Statistical Thesaurus Construction in a Morphologically Rich Language;Applied Artificial Intelligence;2019-02-27

5. Identifying translationese at the word and sub-word level;Digital Scholarship in the Humanities;2014-09-12