Interesting, but 1.3 million pages is somewhat limited. They seem to have done a good job indexing Wikipedia. I'm curious, why not scan the full ipv4 address space and index the main page of every website you find?
You won't be able to scan most of websites this way because most servers expect you to also pass a valid hostname. However you can use domain lists, for example https://purecrawl.com/en/download/domains (or https://domains-monitor.com/ which is paid but has more domains) as an initial seed shouldn't be too bad, but you'll have to ingest terabytes of spammy/low quality content.