Hacker News new | ask | show | jobs
by Sysreq2 555 days ago
You could also consider using the Common Crawl dataset provided by Amazon. Archive.org is more or less a wrapper around it anyways.

https://registry.opendata.aws/commoncrawl/