Hacker News new | ask | show | jobs
by matja 122 days ago
Did you try adding a Cache-Control response header?
2 comments

Even if they haven't added any cache control headers, what kind a of lazy Meta engineer designed their crawler with to just pull the same URL multiple times a second?

Is this where all that hardware for AI projects is going? To data centers that just uncritically hits the same URL over and over without checking if the content of a site or page has chanced since the last visit then and calculate a proper retry interval. Search engine crawlers 25 - 30 years ago could do this.

Hit the URL once per day, if it chances daily, try twice a day. If it hasn't chanced in a week, maybe only retry twice per week.

It's not the "same" crawler. Probably each thread or each cluster machine instance of the crawler hitting it independently.
That's still the same crawler system though. And it's lazy engineering to not build in something to track when you last requested a url.

And it's quite a trivial feature at that.

I sincerely doubt that search engines run their crawlers on a single machine and they got it figured out.
Forgejo does set "cache-control: private, max-age=21600", which is considerably more than one second, but I grant it uses the "private" keyword for no reason here.