1. Julie Abadji , Pedro Suàrez , Javier Ortiz , Laurent Romary , and Benoît Sagot . 2021 . Ungoliant: An optimized pipeline for the generation of a very large-scale multilingual web corpus . In Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-9) 2021 . Limerick , 12 July 2021 (Online-Event). Leibniz-Institut für Deutsche Sprache, 1--9. Julie Abadji, Pedro Suàrez, Javier Ortiz, Laurent Romary, and Benoît Sagot. 2021. Ungoliant: An optimized pipeline for the generation of a very large-scale multilingual web corpus. In Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-9) 2021. Limerick, 12 July 2021 (Online-Event). Leibniz-Institut für Deutsche Sprache, 1--9.
2. Julien Abadji , Pedro Ortiz Suarez , Laurent Romary, and Benoît Sagot. 2022 . Towards a Cleaner Document-Oriented Multilingual Crawled Corpus . (Jan. 2022). arxiv: 2201.06642 [cs.CL] Julien Abadji, Pedro Ortiz Suarez, Laurent Romary, and Benoît Sagot. 2022. Towards a Cleaner Document-Oriented Multilingual Crawled Corpus. (Jan. 2022). arxiv: 2201.06642 [cs.CL]
3. Main Content Extraction from Heterogeneous Webpages
4. What Web Template Extractor Should I Use;Alarte Julián;A Benchmarking and Comparison for Five Template Extractors. ACM Trans. Web,2019
5. Martin Armstrong . 2021 . Infographic: How many websites are there? https://web.archive.org/web/20230131222529/https://www.statista.com/chart/19058/number-of-websites-online/Captured : 31 Jan, 2023. Martin Armstrong. 2021. Infographic: How many websites are there? https://web.archive.org/web/20230131222529/https://www.statista.com/chart/19058/number-of-websites-online/Captured: 31 Jan, 2023.