Hacker News new | ask | show | jobs
by mrweasel 121 days ago
Even if they haven't added any cache control headers, what kind a of lazy Meta engineer designed their crawler with to just pull the same URL multiple times a second?

Is this where all that hardware for AI projects is going? To data centers that just uncritically hits the same URL over and over without checking if the content of a site or page has chanced since the last visit then and calculate a proper retry interval. Search engine crawlers 25 - 30 years ago could do this.

Hit the URL once per day, if it chances daily, try twice a day. If it hasn't chanced in a week, maybe only retry twice per week.

1 comments

It's not the "same" crawler. Probably each thread or each cluster machine instance of the crawler hitting it independently.
That's still the same crawler system though. And it's lazy engineering to not build in something to track when you last requested a url.

And it's quite a trivial feature at that.

I sincerely doubt that search engines run their crawlers on a single machine and they got it figured out.