|
|
|
|
|
by adamcharnock
666 days ago
|
|
> The Google Search Index is a unique and irreplaceable resource within the digital ecosystem. Mandating fair access to it or treating it as an essential facility could address the core issues... The article estimates the Google Search Index at 12.5PB. If Kagi thinks that is a big enough moat to be the primary target then, well, I suppose they should know. But I'm also skeptical. You could fit that on about 50 Hetzner SX295, so about $20k/month. Plus the cost of gathering the data. It is surely a huge resource. But weighed against the combination of Google Search + AdWords + Android + YouTube + Chrome, all in a single company? To me a 12.5PB search index feels like small change in comparison. NB: Happy Kagi-paying customer here. |
|
I realize there was a mistake with the estimated number (thanks for pointing out, should be closer to 180 PB for raw crawl data). Since this is speculative and also does not account for other data needed to actually rank pages, hardware to do it in under 500ms at a scale of billions of queries per day and thus can be misleading in terms of true effort to do it, I edited that datapoint out of the article.
You are right, just crawling large number of pages (millions even billions) is indeed straightforward (eg [1]), it is about creating a searchable index of the web scale that has certain quality level that is simply impossible to do anymore for many reasons that would require another article to explain. Microsoft spent $100bn and last 20 years by their own account trying to match it and most people agree it is still not even close. At some point you reach diminishing returns. To use the analogy from the article, it is akin to someone trying to rebuild all of the US railroad network today. Sounds plausible, but not really in practice. That train has left the station in early 2000s.
[1] https://michaelnielsen.org/ddi/how-to-crawl-a-quarter-billio...