Design and Development of a Large Cross-Lingual Plagiarism Corpus for Urdu-English Language Pair-Reference-Cited by-同舟云学术

Design and Development of a Large Cross-Lingual Plagiarism Corpus for Urdu-English Language Pair

Published:2019-03-17 Issue: Volume:2019 Page:1-11
ISSN:1058-9244
Container-title:Scientific Programming
language:en
Short-container-title:Scientific Programming

Author:

Haneef Israr¹,Adeel Nawab Rao Muhammad²,Munir Ehsan Ullah¹,Bajwa Imran Sarwar³^ORCID

Affiliation:

1. Department of Computer Science, COMSATS Institute of Information Technology, Wah Campus, Wah Cantonment, Pakistan

2. Department of Computer Science, COMSATS Institute of Information Technology, Lahore Campus, Lahore, Pakistan

3. Department of Computer Science, The Islamia University of Bahawalpur, Bahawalpur, Pakistan

Abstract

Cross-lingual plagiarism occurs when the source (or original) text(s) is in one language and the plagiarized text is in another language. In recent years, cross-lingual plagiarism detection has attracted the attention of the research community because a large amount of digital text is easily accessible in many languages through online digital repositories and machine translation systems are readily available, making it easier to perform cross-lingual plagiarism and harder to detect it. To develop and evaluate cross-lingual plagiarism detection systems, standard evaluation resources are needed. The majority of earlier studies have developed cross-lingual plagiarism corpora for English and other European language pairs. However, for Urdu-English language pair, the problem of cross-lingual plagiarism detection has not been thoroughly explored although a large amount of digital text is readily available in Urdu and it is spoken in many countries of the world (particularly in Pakistan, India, and Bangladesh). To fulfill this gap, this paper presents a large benchmark cross-lingual corpus for Urdu-English language pair. The proposed corpus contains 2,395 source-suspicious document pairs (540 are automatic translation, 539 are artificially paraphrased, 508 are manually paraphrased, and 808 are nonplagiarized). Furthermore, our proposed corpus contains three types of cross-lingual examples including artificial (automatic translation and artificially paraphrased), simulated (manually paraphrased), and real (nonplagiarized), which have not been previously reported in the development of cross-lingual corpora. Detailed analysis of our proposed corpus was carried out using n-gram overlap and longest common subsequence approaches. Using Word unigrams, mean similarity scores of 1.00, 0.68, 0.52, and 0.22 were obtained for automatic translation, artificially paraphrased, manually paraphrased, and nonplagiarized documents, respectively. These results show that documents in the proposed corpus are created using different obfuscation techniques, which makes the dataset more realistic and challenging. We believe that the corpus developed in this study will help to foster research in an underresourced language of Urdu and will be useful in the development, comparison, and evaluation of cross-lingual plagiarism detection systems for Urdu-English language pair. Our proposed corpus is free and publicly available for research purposes.

Publisher

Hindawi Limited

Subject

Computer Science Applications,Software

Link

http://downloads.hindawi.com/journals/sp/2019/2962040.pdf

Reference21 articles.

1. Methods for cross-language plagiarism detection

2. Arabic-English Cross-language Plagiarism Detection using Winnowing Algorithm

Cited by 15 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A Study of Business English Translation Skills Based on Parallel Corpus;Applied Mathematics and Nonlinear Sciences;2024-01-01

2. Enhancing Urdu Intrinsic Plagiarism Detection Through Stylometry Features and Machine Learning;2023 25th International Multitopic Conference (INMIC);2023-11-17

3. Cross-lingual Text Reuse Detection at Document Level for English-Urdu Language Pair;ACM Transactions on Asian and Low-Resource Language Information Processing;2023-06-16

4. Urdu paraphrase detection: A novel DNN-based implementation using a semi-automatically generated corpus;Natural Language Engineering;2023-05-29

5. Siamese-Based Architecture for Cross-Lingual Plagiarism Detection in English–Hindi Language Pairs;Big Data;2023-02-01