Affiliation:
1. Department of Applied Mathematics and Computer Science, Lahijan Branch Islamic Azad University, Lahijan, Iran
Abstract
Named-entity Recognition (NER) is challenging for languages with low digital resources. The main difficulties arise from the scarcity of annotated corpora and the consequent problematic training of an effective NER Model. We propose a customized model based on linguistic properties to compensate for this lack of resources in low-resource languages like Persian. According to pronoun-dropping and subject-object-verb word order specifications of Persian, we propose new weighted relative positional encoding in the self-attention mechanism. Using the pointwise mutual information factor, we inject co-occurrence information into context representation. We trained and tested our model on three different datasets: Arman, Peyma, and ParsTwiNER, and our method achieved 94.16%, 93.36%, and 84.49% word-level F1 scores, respectively. The experiments showed that our proposed model performs better than other Persian NER models. Ablation Study and Case Study also showed that our method can converge faster and is less prone to overfitting.
Publisher
Association for Computing Machinery (ACM)
Reference33 articles.
1. Chinese named entity recognition: The state of the art
2. Attention is all you need;Vaswani A.;Adv. Neural Info. Process. Syst.,2017
3. P. Shaw J. Uszkoreit and A. Vaswani. 2018. Self-attention with relative position representations. Retrieved from https://arXiv:1803.02155
4. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
5. A. Conneau et al. 2019. Unsupervised cross-lingual representation learning at scale. Retrieved from https://arXiv:1911.02116