(I work at Vercel) While it's good our spend limits worked, it clearly was not obvious how to block or challenge AI crawlers¹ from our firewall (which it seems you manually found). We'll surface this better in the UI, and also have more bot protection features coming soon. Also glad our improved image optimization pricing² would have helped. Open to other feedback as well, thanks for sharing.
Hi, I'm the author of the blog (though I didn't post it on HN).
1) Our biggest issue right now is unidentified crawlers with user agents resembling regular users. We get hundreds of thousands of requests from those daily and I'm not sure how to block them on Vercel.
I'd love them to be challenged. If a bot doesn't identify itself, we don't want to let it in.
2) While we fixed the Image Optimization part and optimized caching, we're now struggling with ISR Write costs. We deploy often and the ISR cache is reset on each deploy.
We are about to put Cloudflare in front of the site, so that we can set Cache-Control headers and cache SSR pages (rather than using ISR) independently.
I'm sure what you can share is limited, as I'm guessing this is cat and mouse. That being said, is there anything you can share about your implementation?
We’re working on a bot filtering system that blocks all non-browser traffic by default. Alongside that, we’re building a directory of verified bots, and you’ll be able to opt in to allow traffic only from those trusted sources. Hopefully shipping soon.
Verified bots? You mean the companies that got big reading your info so now you know who they are, but not allow any new comers so the people that were taking the data all this time get rewarded by killing competition for them. lol.
You have it exactly right sans the reason to allow them in the first place. They're bots that provide reciprocal value to the site owner. Otherwise why even bother letting them through.
It's wild how people don't get that facebook and googlebot gets let through paywalls and such because they bring the site real tangible revenue. If you want to get the same privileges you have to start with the monetary value provided to the sites you index. Lead gen is hard and major search engines provide crazy value for next to nothing.
Do they? AI bots provide me with nothing (best case scenario) or giving my content in their pages without "read more" links thus lowering my number of visitors.
Search bots, and specially Google, provide my site a lot of value. They respect the robots.txt, I can see that about half my visits come from search, they identify properly as bots. It's almost impossible to notice a search bot in the graphs.
But AI bots suck. They don't even read the robots.txt, they hit the site as hard as it can hold, when they receive a 5xx, a 444 or a 426 they interpret it as "keep requesting hard until you get a 200", they can easily DoS or bankrupt a small site, they use fake user agents. As the OP post shows, their activity can be clearly seen in the log graphs as huge spikes coming from a single client. OpenAI scanned 100% of one of my sites (more than 20,000 individual pages) in two days causing intermitent DoS, while the Google is at 80% of the sitemap.xml. And cherry on top, I still can't see a single visit in my logs that come from their services.
I think you might be confusing search bots with AI bots.
1) Our biggest issue right now is unidentified crawlers with user agents resembling regular users. We get hundreds of thousands of requests from those daily and I'm not sure how to block them on Vercel.
I'd love them to be challenged. If a bot doesn't identify itself, we don't want to let it in.
2) While we fixed the Image Optimization part and optimized caching, we're now struggling with ISR Write costs. We deploy often and the ISR cache is reset on each deploy.
We are about to put Cloudflare in front of the site, so that we can set Cache-Control headers and cache SSR pages (rather than using ISR) independently.