Optimized Feature Extraction and Cross-Lingual Text Reuse Detection using Ensemble Machine Learning Models

Author:

Maqbool Muhammad Sajid1,Hanif Israr1,Iqbal Sajid1,Basit Abdul1,Shabbir Aiman2

Affiliation:

1. Bahauddin Zakariya University

2. Muhammad Nawaz Sharif University of Agriculture

Abstract

Abstract With the availability of digital data in different languages, cross-lingual plagiarism (CLP) detection has gained more importance. CLP is difficult to detect because suspicious and source texts can be written in different languages and processing of digitized text in different languages presents varying types of challenges. In this work, we propose a cross-lingual plagiarism detection method using machine learning algorithms. In this work, we have created an ensemble of machine learning algorithms and to evaluate the designed methodology, a corpus focusing Urdu-English language pair titled CLPD-UE-19 (Israr Haneef et al. 2019) is used. The corpus is a collection of 2398 documents where the source text is written in Urdu language and the suspicious text is presented in the English language. Using NLP methods, optimal features are extracted and fed to designed ensemble method for document classification. A number of aggregating techniques are employed which include majority voting, stacking, averaging, boosting, and bagging. Among these models, the stacking has performed the best achieving accuracy of 96 percent.

Publisher

Research Square Platform LLC

Reference42 articles.

1. Haneef, I., Nawab, A., Munir, R. M., E. U., & Bajwa, I. S. (2019). Design and development of a large cross-lingual plagiarism corpus for Urdu-English language pair. Scientific Programming, 2019

2. Cross-lingual plagiarism detection techniques for English-Hindi language pairs;Agarwal B;Journal of Discrete Mathematical Sciences and Cryptography,2019

3. Ikae, C., Nath, S., & Savoy, J. (2019). UniNE at PAN-CLEF 2019: Bots and Gender Task. In CLEF (Working Notes)

4. Alzahrani, S., & Aljuaid, H. (2020). Identifying cross-lingual plagiarism using rich semantic features and deep neural networks: A study on Arabic-English plagiarism cases. Journal of King Saud University-Computer and Information Sciences

5. Al-Suhaiqi, M., Hazaa22, M. A., & Albared (2018). 33, M. Arabic English Cross-Lingual Plagiarism Detection Based on Keyphrases Extraction, 2 Monolingual and Machine Learning Approach 3

Cited by 3 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. An Efficient Deep Learning Approach for Prediction of Student Performance Using Neural Network;VFAST Transactions on Software Engineering;2023-12-12

2. Optimized Classification of Cardiovascular Disease Using Machine Learning Paradigms;VFAST Transactions on Software Engineering;2023-07-10

3. Sentiment Analysis of Omicron Tweets by using Machine Learning Models;VFAST Transactions on Software Engineering;2023-03-31

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3