Hacker News new | ask | show | jobs
by myzie 453 days ago
An aspect I find interesting is that these crawlers are all doing highly redundant work. As in, thousands of crawlers are running around the world, and each crawler may visit the same site and pages multiple times a week.

This seems like an opportunity for a company like Firecrawl, ScrapingBee, etc to offer built-in caching with TTLs so that redundant requests can hit the cache and not contribute to load on the actual site.

Even if each company that operates a crawler cached pages across multiple runs, I'd expect a large improvement in the situation.

For more dynamic pages, this obviously doesn't help. But a lot of the web's content is more static and is being crawled thousands of times.

I built something for my own company that crawls using Playwright and caches in S3/Postgres with a TTL for this purpose.

Does this make sense to anyone else? I'm not sure if I'm missing something that makes this harder than it seems on the surface. (Actual question!)

2 comments

I have considered this before, but then if the content can be cached why wouldn't the website just do this themselves?

They have the incentive, it is relatively easy and I don't think there's a huge benefit to centralisation (especially since it will basically be centralised to one of the big providers of caching anyways)

I'm definitely with you that sites should be leveraging CDNs and similar. But I get that many don't want to do any work to support bots that they don't want to exist in the first place.

To me it seems like the companies actually doing the crawling have an incentive to leverage centralized caching. It makes their own crawling faster (since hitting the cache is much faster than using Playwright etc to load the page) and it reduces the impact on all these sites. Which would then also decrease the impact of this whole bot situation overall.

It would shift the complexity and cost of large scale caching to a provider that would sell to the scrapers. Not sure it has much value, but it’s kind of a classic three tier distribution system with a middleman to make life easier for both producer and consumer.
What does the user agent oook like for if you wanted to crawl xeiaso.net?