1. Abadji, J., Suarez, P.O., Romary, L., Sagot, B.: Towards a cleaner document-oriented multilingual crawled corpus. In: Thirteenth Language Resources and Evaluation Conference-LREC 2022 (2022)
2. Baroni, M., Bernardini, S.: BootCaT: bootstrapping corpora and terms from the web. In: LREC, pp. 1313–1316 (2004)
3. Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Lang. Resour. Eval. 43, 209–226 (2009)
4. Benko, V.: Data Deduplication in Slovak Corpora. In: Slovko 2013: Natural Language Processing, Corpus Linguistics, E-learning, pp. 27–39. RAM-Verlag: Lüdenscheid (2013)
5. Lecture Notes in Computer Science (Lecture Notes in Artificial Intelligence);V Benko,2014