An Information-Extraction System for Urdu---A Resource-Poor Language

Author:

Mukund Smruthi1,Srihari Rohini1,Peterson Erik2

Affiliation:

1. State University of New York at Buffalo

2. Janya, Inc.

Abstract

There has been an increase in the amount of multilingual text on the Internet due to the proliferation of news sources and blogs. The Urdu language, in particular, has experienced explosive growth on the Web. Text mining for information discovery, which includes tasks such as identifying topics, relationships and events, and sentiment analysis, requires sophisticated natural language processing (NLP). NLP systems begin with modules such as word segmentation, part-of-speech tagging, and morphological analysis and progress to modules such as shallow parsing and named entity tagging. While there have been considerable advances in developing such comprehensive NLP systems for English, the work for Urdu is still in its infancy. The tasks of interest in Urdu NLP includes analyzing data sources such as blogs and comments to news articles to provide insight into social and human behavior. All of this requires a robust NLP system. The objective of this work is to develop an NLP infrastructure for Urdu that is customizable and capable of providing basic analysis on which more advanced information extraction tools can be built. This system assimilates resources from various online sources to facilitate improved named entity tagging and Urdu-to-English transliteration. The annotated data required to train the learning models used here is acquired by standardizing the currently limited resources available for Urdu. Techniques such as bootstrap learning and resource sharing from a syntactically similar language, Hindi, are explored to augment the available annotated Urdu data. Each of the new Urdu text processing modules has been integrated into a general text-mining platform. The evaluations performed demonstrate that the accuracies have either met or exceeded the state of the art.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Reference101 articles.

1. Ace. 2005. Project specifications. ACE data overview. Ace . 2005. Project specifications. ACE data overview.

2. Improving part-of-speech tagging accuracy for Croatian by morphological analysis;Agi;Informatica,2008

Cited by 31 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Advancing Urdu NLP: Aspect-Based Sentiment Analysis with Graph Attention Networks;2024 International Joint Conference on Neural Networks (IJCNN);2024-06-30

2. Quantum Transfer Learning for Sentiment Analysis: an experiment on an Italian corpus;Proceedings of the 2024 Workshop on Quantum Search and Information Retrieval;2024-06-03

3. Enriching Urdu NER with BERT Embedding, Data Augmentation, and Hybrid Encoder-CNN Architecture;ACM Transactions on Asian and Low-Resource Language Information Processing;2024-04-15

4. A deep learning approach for Named Entity Recognition in Urdu language;PLOS ONE;2024-03-28

5. Roman Urdu Slang Dictionary Development for Facebook Comment Sentiment Analysis;2024 IEEE 1st Karachi Section Humanitarian Technology Conference (KHI-HTC);2024-01-08

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3