| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by matja 122 days ago
	Did you try adding a Cache-Control response header?

2 comments

mrweasel 122 days ago

Even if they haven't added any cache control headers, what kind a of lazy Meta engineer designed their crawler with to just pull the same URL multiple times a second?

Is this where all that hardware for AI projects is going? To data centers that just uncritically hits the same URL over and over without checking if the content of a site or page has chanced since the last visit then and calculate a proper retry interval. Search engine crawlers 25 - 30 years ago could do this.

Hit the URL once per day, if it chances daily, try twice a day. If it hasn't chanced in a week, maybe only retry twice per week.

link

bot403 122 days ago

It's not the "same" crawler. Probably each thread or each cluster machine instance of the crawler hitting it independently.

link

OliverGuy 121 days ago

That's still the same crawler system though. And it's lazy engineering to not build in something to track when you last requested a url.

And it's quite a trivial feature at that.

link

mrweasel 121 days ago

I sincerely doubt that search engines run their crawlers on a single machine and they got it figured out.

link

Ndymium 121 days ago

Forgejo does set "cache-control: private, max-age=21600", which is considerably more than one second, but I grant it uses the "private" keyword for no reason here.

link