Optimized Feature Extraction and Cross-Lingual Text Reuse Detection using Ensemble Machine Learning Models-Reference-Cited by-同舟云学术

Optimized Feature Extraction and Cross-Lingual Text Reuse Detection using Ensemble Machine Learning Models

Published:2022-10-06 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Maqbool Muhammad Sajid¹,Hanif Israr¹,Iqbal Sajid¹,Basit Abdul¹,Shabbir Aiman²

Affiliation:

1. Bahauddin Zakariya University

2. Muhammad Nawaz Sharif University of Agriculture

Abstract

Abstract With the availability of digital data in different languages, cross-lingual plagiarism (CLP) detection has gained more importance. CLP is difficult to detect because suspicious and source texts can be written in different languages and processing of digitized text in different languages presents varying types of challenges. In this work, we propose a cross-lingual plagiarism detection method using machine learning algorithms. In this work, we have created an ensemble of machine learning algorithms and to evaluate the designed methodology, a corpus focusing Urdu-English language pair titled CLPD-UE-19 (Israr Haneef et al. 2019) is used. The corpus is a collection of 2398 documents where the source text is written in Urdu language and the suspicious text is presented in the English language. Using NLP methods, optimal features are extracted and fed to designed ensemble method for document classification. A number of aggregating techniques are employed which include majority voting, stacking, averaging, boosting, and bagging. Among these models, the stacking has performed the best achieving accuracy of 96 percent.

Publisher

Research Square Platform LLC

Reference42 articles.

1. Haneef, I., Nawab, A., Munir, R. M., E. U., & Bajwa, I. S. (2019). Design and development of a large cross-lingual plagiarism corpus for Urdu-English language pair. Scientific Programming, 2019

2. Cross-lingual plagiarism detection techniques for English-Hindi language pairs;Agarwal B;Journal of Discrete Mathematical Sciences and Cryptography,2019

3. Ikae, C., Nath, S., & Savoy, J. (2019). UniNE at PAN-CLEF 2019: Bots and Gender Task. In CLEF (Working Notes)

4. Alzahrani, S., & Aljuaid, H. (2020). Identifying cross-lingual plagiarism using rich semantic features and deep neural networks: A study on Arabic-English plagiarism cases. Journal of King Saud University-Computer and Information Sciences

5. Al-Suhaiqi, M., Hazaa22, M. A., & Albared (2018). 33, M. Arabic English Cross-Lingual Plagiarism Detection Based on Keyphrases Extraction, 2 Monolingual and Machine Learning Approach 3

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. An Efficient Deep Learning Approach for Prediction of Student Performance Using Neural Network;VFAST Transactions on Software Engineering;2023-12-12

2. Optimized Classification of Cardiovascular Disease Using Machine Learning Paradigms;VFAST Transactions on Software Engineering;2023-07-10

3. Sentiment Analysis of Omicron Tweets by using Machine Learning Models;VFAST Transactions on Software Engineering;2023-03-31