|
|
|
|
|
by myzie
453 days ago
|
|
An aspect I find interesting is that these crawlers are all doing highly redundant work. As in, thousands of crawlers are running around the world, and each crawler may visit the same site and pages multiple times a week. This seems like an opportunity for a company like Firecrawl, ScrapingBee, etc to offer built-in caching with TTLs so that redundant requests can hit the cache and not contribute to load on the actual site. Even if each company that operates a crawler cached pages across multiple runs, I'd expect a large improvement in the situation. For more dynamic pages, this obviously doesn't help. But a lot of the web's content is more static and is being crawled thousands of times. I built something for my own company that crawls using Playwright and caches in S3/Postgres with a TTL for this purpose. Does this make sense to anyone else? I'm not sure if I'm missing something that makes this harder than it seems on the surface. (Actual question!) |
|
They have the incentive, it is relatively easy and I don't think there's a huge benefit to centralisation (especially since it will basically be centralised to one of the big providers of caching anyways)