| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by adamcharnock 666 days ago

> The Google Search Index is a unique and irreplaceable resource within the digital ecosystem. Mandating fair access to it or treating it as an essential facility could address the core issues...

The article estimates the Google Search Index at 12.5PB. If Kagi thinks that is a big enough moat to be the primary target then, well, I suppose they should know. But I'm also skeptical. You could fit that on about 50 Hetzner SX295, so about $20k/month. Plus the cost of gathering the data. It is surely a huge resource.

But weighed against the combination of Google Search + AdWords + Android + YouTube + Chrome, all in a single company? To me a 12.5PB search index feels like small change in comparison.

NB: Happy Kagi-paying customer here.

4 comments

freediver 666 days ago

> The article estimates the Google Search Index at 12.5PB.

I realize there was a mistake with the estimated number (thanks for pointing out, should be closer to 180 PB for raw crawl data). Since this is speculative and also does not account for other data needed to actually rank pages, hardware to do it in under 500ms at a scale of billions of queries per day and thus can be misleading in terms of true effort to do it, I edited that datapoint out of the article.

You are right, just crawling large number of pages (millions even billions) is indeed straightforward (eg [1]), it is about creating a searchable index of the web scale that has certain quality level that is simply impossible to do anymore for many reasons that would require another article to explain. Microsoft spent $100bn and last 20 years by their own account trying to match it and most people agree it is still not even close. At some point you reach diminishing returns. To use the analogy from the article, it is akin to someone trying to rebuild all of the US railroad network today. Sounds plausible, but not really in practice. That train has left the station in early 2000s.

[1] https://michaelnielsen.org/ddi/how-to-crawl-a-quarter-billio...

link

adamcharnock 666 days ago

Thank you for the reply!

> it is about creating a searchable index of the web scale that has certain quality level that is simply impossible to do anymore for many reasons that would require another article to explain

I am both happy to take your word for it, and also very interested to know more. If you were to write that article then I would love to read it.

link

erlend_sh 665 days ago

Aye, such a follow-up article would greatly help bolster the case being made here, which I’m fully on board with.

link

IgorPartola 666 days ago

This puts it in enough perspective for me to ask: why doesn’t a university create a public/open source search index? Seems like a way to get a ton of attention.

Moreover, archive.org has all the data and data storage capabilities many times over. What prevents them from creating an open source search engine?

link

scroot 666 days ago

Or the Library of Congress, if it had the right appropriations. Google itself started with an NSF grant to explore the future of libraries.

link

arnaudsm 666 days ago

I don't buy this number. Text-only common crawl is 20TB. Remove spam and dupes, you're around <10TB of current useful data. Which you can parse and index on a single server nowadays.

It's the full Google index history with full HTML that is probably 12PB, but the useful part of the search engine is much smaller.

link

1vuio0pswjnm7 666 days ago

Does CC publish the methodology for how they determine what to crawl. More particularly, how do they determine what not to crawl.

link

arnaudsm 666 days ago

Yes a few big sites are missing, notably reddit. Most of CC is spam though, the real useful content is really small.

I'm experimenting my own search engine at the moment, and am considering to make it public at some point. It's not that impossible of a task!

link

dmonitor 666 days ago

I assume that the major hurdle is not storing an equivalently-sized search index, but building one from scratch. Crawling takes time, and Google has had a many years head start.

link

pierrefermat1 666 days ago

Yes OP is hilariously out of touch, the storage would be sub 1% of total costs.

link