| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by structural 483 days ago

One major difference is that while indexing, you're generating an internal data structure that represents that site. Once done, if the site doesn't change, you don't have any need to revisit it, and in fact, fetching the site multiple times just increases your own costs.

On the other hand, an unsupervised AI training algorithm may just need raw text, and as much of it as possible. It doesn't know what site it came from or much care, and it's not building any index that links the content back to its original source. So fetching the site on each training epoch might actually be viable: why bother storing the entire internet when you can just fetch -> transform -> ingest into your model? If your crawler is distributed enough, it won't be the bottleneck, either.

If this is the architecture some companies are using, this also means that these crawlers won't ever stop, because they are finetuning some model by constantly updating over time based on the "current" internet, whatever that might mean.