Affiliation:
1. Riga Technical University , Riga , Latvia
Abstract
Abstract
Generally, the process of plagiarism detection can be divided into two main stages: source retrieval and text alignment. The paper evaluates and compares effectiveness of five fingerprint selection algorithms used during the source retrieval stage: Every p-th, 0 mod p, Winnowing, Frequency-biased Winnowing (FBW) and Modified FBW (MFBW). The algorithms are evaluated on a dataset containing plagiarism cases in Bachelor and Master Theses written in English in the field of computer science. The best performance is reached by 0 mod p, Winnowing and MFBW. For these algorithms, reduction of fingerprint size from 100 % to about 20 % kept the effectiveness at approximately the same level. Moreover, MFBW sends overall fewer document pairs to the text alignment stage, thus also reducing the computational cost of the process. The software developed for this study is freely available at the author’s website http://www.cs.rtu.lv/jekabsons/.
Reference18 articles.
1. [1] M. Potthast, M. Hagen, A. Beyer, M. Busse, M. Tippmann, P. Rosso, and B. Stein, “Overview of the 6th International competition on plagiarism detection,” in CEUR Workshop Proceedings, vol. 1180, 2014, pp. 845–876.
2. [2] D. T. Citron and P. Ginsparg, “Patterns of text reuse in a scientific corpus,” in Proceedings of the National Academy of Sciences of the USA, PNAS, vol. 112, no. 1, pp. 25–30, Jan. 2015. https://doi.org/10.1073/pnas.141513511110.1073/pnas.1415135111429161625489072
3. [3] Y. Sun, J. Qin, and W. Wang, “Near duplicate text detection using frequency-biased signatures,” in Web Information Systems Engineering (WISE 2013), Lecture Notes in Computer Science, vol. 8180. Springer, Berlin, Heidelberg, 2013, pp. 277–291. https://doi.org/10.1007/978-3-642-41230-1_2410.1007/978-3-642-41230-1_24
4. [4] O. Abdel-Hamid, B. Behzadi, S. Christoph, and M. Henzinger, “Detecting the origin of text segments efficiently,” in WWW’09: Proceedings of the 18th international conference on World wide web, ACM, New York, NY, USA, 2009, pp. 61–70. https://doi.org/10.1145/1526709.152671910.1145/1526709.1526719
5. [5] J. Seo and W. B. Croft. “Local text reuse detection,” in Proceedings of SIGIR’08, Singapore,ACM, ACM Press, July 2008, pp. 571–578. https://doi.org/10.1145/1390334.139043210.1145/1390334.1390432