| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by arnaudsm 663 days ago
	I don't buy this number. Text-only common crawl is 20TB. Remove spam and dupes, you're around <10TB of current useful data. Which you can parse and index on a single server nowadays. It's the full Google index history with full HTML that is probably 12PB, but the useful part of the search engine is much smaller.

1 comments

Does CC publish the methodology for how they determine what to crawl. More particularly, how do they determine what not to crawl.

Yes a few big sites are missing, notably reddit. Most of CC is spam though, the real useful content is really small.

I'm experimenting my own search engine at the moment, and am considering to make it public at some point. It's not that impossible of a task!