| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by zepearl 672 days ago

> I would presume Google still has all this data. ...

Maybe - I guess that they must have served that "cached" content from DB-records that had it all saved directly (URL X has contents Y => basically a "mirror" of the terms that they indexed) => not having to store that "mirror" (only the search index) might save quite a lot of storage space (and I/O and CPU to decompress it, as users won't be requesting it anymore) => all in all that might save quite a lot of infrastructure costs $$$.

> Could this be an advantage that Google can use to train their models on but others won't have access?

Maybe (if they decided to just get rid of the I/O related to the user requests), but on the other hand I don't know if previously any "Google-consumer" was ever able to perform mass-downloads of Google's "cached" data - could that be done without being banned by Google's webpage (or API)?