Understanding the Research Challenges in Low-Resource Language and Linking Bilingual News Articles in Multilingual News Archive

Author:

Khan Muzammil1ORCID,Ullah Kifayat1,Alharbi Yasser2,Alferaidi Ali2,Alharbi Talal Saad2,Yadav Kusum2,Alsharabi Naif2ORCID,Ahmad Aakash2

Affiliation:

1. Department of Computer and Software Technology, University of Swat, Mingora 19130, Pakistan

2. College of Computer Science & Engineering, University of Ha’il, Ha’il 81451, Saudi Arabia

Abstract

The developed world has focused on Web preservation compared to the developing world, especially news preservation for future generations. However, the news published online is volatile because of constant changes in the technologies used to disseminate information and the formats used for publication. News preservation became more complicated and challenging when the archive began to contain articles from low-resourced and morphologically complex languages like Urdu and Arabic, along with English news articles. The digital news story preservation framework is enriched with eighteen sources for Urdu, Arabic, and English news sources. This study presents challenges in low-resource languages (LRLs), research challenges, and details of how the framework is enhanced. In this paper, we introduce a multilingual news archive and discuss the digital news story extractor, which addresses major issues in implementing low-resource languages and facilitates normalized format migration. The extraction results are presented in detail for high-resource languages, i.e., English, and low-resource languages, i.e., Urdu and Arabic. LRLs encountered a high error rate during preservation compared to high-resource languages (HRLs), corresponding to 10% and 03%, respectively. The extraction results show that few news sources are not regularly updated and release few new news stories online. LRLs require more detailed study for accurate news content extraction and archiving for future access. LRLs and HRLs enrich the digital news story preservation (DNSP) framework. The Digital News Stories Archive (DNSA) preserves a huge number of news articles from multiple news sources in LRLs and HRLs. This paper presents research challenges encountered during the preservation of Urdu and Arabic-language news articles to create a multilingual news archive. The second part of the paper compares two bilingual linking mechanisms for Urdu-to-English-language news articles in the DNSA: the common ratio measure for dual language (CRMDL) and the similarity measure based on transliteration words (SMTW) with the cosine similarity measure (CSM) baseline technique. The experimental results show that the SMTW is more effective than the CRMDL and CSM for linking Urdu-to-English news articles. The precision improved from 46% and 50% to 60%, and the recall improved from 64% and 67% to 82% for CSM, CRMDL, and SMTW, respectively, with improved impact of common terms as well.

Funder

Scientific Research Deanship at the University of Ha’il—Saudi Arabia

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Reference45 articles.

1. Khan, M. (2018). Using Text Processing Techniques for Linking News Stories for Digital Preservation. [Ph.D. Thesis, Faculty of Computer Science, Islamabad Campus, Preston University Kohat].

2. (2021, August 04). WWW Size The Size of the World Wide Web (The Internet). Available online: https://www.worldwidewebsize.com/.

3. Understandable big data: A survey;Emani;Comput. Sci. Rev.,2015

4. UNESCO (2023, July 19). UNESCO Universal Declaration on Archives. Available online: https://www.ica.org/en/universal-declaration-archives.

5. Skinner, K., and Schultz, M. (2014). Guidelines for Digital Newspaper Preservation Readiness, Educopia Institute.

Cited by 1 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3