| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kevincox 501 days ago
	It seems like common cases like this should be handled correctly by default. These are cachable requests intended for robots. Sure, it would be nice if webmasters configure it but I suspect a tiny minority does. For example even Cloudflare hasn't configure their official blog's RSS feed properly. My feed reader (running in a DigitalOcean datacenter) hasn't been able to access it since 2021 (403 every time even though backed off to checking weekly). This is a cachable endpoint with public data intended for robots. If they can't configure their own product correctly for their official blog how can they expect other sites to?

1 comments

progmetaldev 501 days ago

I agree, but I also somewhat understand. Some people will actually pay more per month for Cloudflare than their own hosting. The Cloudflare Pro plan is $20/month USD. Some sites wouldn't be able to handle the constant requests for robots.txt, just because bots don't necessarily respect cache headers (if they are even configured for robots.txt), and the sheer number of bots that look at robots.txt and will ignore a caching header are too numerous.

If you are writing some kind of malicious crawler that doesn't care about rate-limiting, and wants to scan as many sites as possible for the most vulnerable to get a list together to hack, you will scan robots.txt because that is the file that tells robots NOT to index these pages. I never use a robots.txt for some kind of security through obscurity. I've only ever bothered with robots.txt to make SEO easier when you can control a virtual subdirectory of a site, to block things like repeated content with alternative layouts (to avoid duplicate content issues), or to get a section of a website to drop out of SERPs for discontinued sections of a site.

link

kevincox 501 days ago

> sheer number of bots that look at robots.txt and will ignore a caching header

This is not relevant because Cloudflare will cache it so it never hits your origin. Unless they are adding random URL parameters (which you can teach Cloudflare to ignore but I don't think that should be a default configuration).

link

progmetaldev 501 days ago

The thing is, it won't do that by default. You have to enable caching currently, when creating a new account. I use a service that detects if a website is still running, and it does this by using a certain URL parameter to bypass the cache.

Again, I think you are correct with more sane defaults, but I don't know if you've ever dealt with a network admin or web administrator that hasn't dealt with server-side caching vs. browser caching, but it most definitely would end up with Cloudflare losing sales because people misunderstood how things work. Maybe I'm jaded, at 45, but I feel like most people don't even know to look at headers by default when they feel they hit a caching issue. I don't think it's based on age, I think it's based on being interested in the technology and wanting to learn all about it. Mostly developers that got into it for the love of technology, versus those that got into it because it was high paying and they understood Excel, or learned to build a simple website early in life, so everyone told them to get into software.

link