1. Amazon. [n. d.]. Amazon Web Services (AWS) Open Data Sponsorship Program. Web document. https://aws.amazon.com/opendata/open-data-sponsorship-program/ Retrieved: 22 November 2023.
2. Apache Software Foundation. [n. d.]. Content analysis toolkit. Web document. https://tika.apache.org/ Retrieved 7 December 2023.
3. Internet Archive. [n. d.]. Sort-friendly URI Reordering Transform. Web document. http://crawler.archive.org/articles/user_manual/glossary.html#surt Retrieved: 29 November 2022.
4. Stefan Baack and Mozilla Insights. 2024. Training Data for the Price of a Sandwich. Web document. https://foundation.mozilla.org/en/research/library/generative-ai-training-data/common-crawl/ Retrieved: 25 February 2024.
5. An Empirical Study of the Use of Integrity Verification Mechanisms for Web Subresources