| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by freediver 661 days ago

> The article estimates the Google Search Index at 12.5PB.

I realize there was a mistake with the estimated number (thanks for pointing out, should be closer to 180 PB for raw crawl data). Since this is speculative and also does not account for other data needed to actually rank pages, hardware to do it in under 500ms at a scale of billions of queries per day and thus can be misleading in terms of true effort to do it, I edited that datapoint out of the article.

You are right, just crawling large number of pages (millions even billions) is indeed straightforward (eg [1]), it is about creating a searchable index of the web scale that has certain quality level that is simply impossible to do anymore for many reasons that would require another article to explain. Microsoft spent $100bn and last 20 years by their own account trying to match it and most people agree it is still not even close. At some point you reach diminishing returns. To use the analogy from the article, it is akin to someone trying to rebuild all of the US railroad network today. Sounds plausible, but not really in practice. That train has left the station in early 2000s.

[1] https://michaelnielsen.org/ddi/how-to-crawl-a-quarter-billio...

1 comments

adamcharnock 660 days ago

Thank you for the reply!

> it is about creating a searchable index of the web scale that has certain quality level that is simply impossible to do anymore for many reasons that would require another article to explain

I am both happy to take your word for it, and also very interested to know more. If you were to write that article then I would love to read it.

link

erlend_sh 660 days ago

Aye, such a follow-up article would greatly help bolster the case being made here, which I’m fully on board with.

link