Hacker News new | ask | show | jobs
by adamcharnock 666 days ago
> The Google Search Index is a unique and irreplaceable resource within the digital ecosystem. Mandating fair access to it or treating it as an essential facility could address the core issues...

The article estimates the Google Search Index at 12.5PB. If Kagi thinks that is a big enough moat to be the primary target then, well, I suppose they should know. But I'm also skeptical. You could fit that on about 50 Hetzner SX295, so about $20k/month. Plus the cost of gathering the data. It is surely a huge resource.

But weighed against the combination of Google Search + AdWords + Android + YouTube + Chrome, all in a single company? To me a 12.5PB search index feels like small change in comparison.

NB: Happy Kagi-paying customer here.

4 comments

> The article estimates the Google Search Index at 12.5PB.

I realize there was a mistake with the estimated number (thanks for pointing out, should be closer to 180 PB for raw crawl data). Since this is speculative and also does not account for other data needed to actually rank pages, hardware to do it in under 500ms at a scale of billions of queries per day and thus can be misleading in terms of true effort to do it, I edited that datapoint out of the article.

You are right, just crawling large number of pages (millions even billions) is indeed straightforward (eg [1]), it is about creating a searchable index of the web scale that has certain quality level that is simply impossible to do anymore for many reasons that would require another article to explain. Microsoft spent $100bn and last 20 years by their own account trying to match it and most people agree it is still not even close. At some point you reach diminishing returns. To use the analogy from the article, it is akin to someone trying to rebuild all of the US railroad network today. Sounds plausible, but not really in practice. That train has left the station in early 2000s.

[1] https://michaelnielsen.org/ddi/how-to-crawl-a-quarter-billio...

Thank you for the reply!

> it is about creating a searchable index of the web scale that has certain quality level that is simply impossible to do anymore for many reasons that would require another article to explain

I am both happy to take your word for it, and also very interested to know more. If you were to write that article then I would love to read it.

Aye, such a follow-up article would greatly help bolster the case being made here, which I’m fully on board with.
This puts it in enough perspective for me to ask: why doesn’t a university create a public/open source search index? Seems like a way to get a ton of attention.

Moreover, archive.org has all the data and data storage capabilities many times over. What prevents them from creating an open source search engine?

Or the Library of Congress, if it had the right appropriations. Google itself started with an NSF grant to explore the future of libraries.
I don't buy this number. Text-only common crawl is 20TB. Remove spam and dupes, you're around <10TB of current useful data. Which you can parse and index on a single server nowadays.

It's the full Google index history with full HTML that is probably 12PB, but the useful part of the search engine is much smaller.

Does CC publish the methodology for how they determine what to crawl. More particularly, how do they determine what not to crawl.
Yes a few big sites are missing, notably reddit. Most of CC is spam though, the real useful content is really small.

I'm experimenting my own search engine at the moment, and am considering to make it public at some point. It's not that impossible of a task!

I assume that the major hurdle is not storing an equivalently-sized search index, but building one from scratch. Crawling takes time, and Google has had a many years head start.
Yes OP is hilariously out of touch, the storage would be sub 1% of total costs.