They aren't blocking anything. They are just asking nicely not to be crawled. Given that AI companies haven't cared a single bit about ripping of other's peoples data I don't see why they would care now.
LPT: switch to the audio captcha. Yes, it takes a bit longer than if you did one grid captcha perfectly, but I never have to sit there and wonder if a square really has a crosswalk or not, and I never wind up doing more than one.
In their attempt to block OpenAI, they block me. Many sites that were accessible just 2 years ago, require login/captchas/rectal exam now just to read the content.
They block plenty and they do it crudely. I get suspicious traffic bans from reddit all the time. Trivial enough to route around by switching user agent however. Which goes to show any crawling bot writer worth their salt already routes around reddit and most other sites bs by now. I’m just the one getting the occasional headache because I use firefox and block ads and site tracking I guess.
Yeah, probably right. If you want a great rabbit hole, look up "Common Crawl" and see how a great academic project was absolutely hijacked for pennies on the dollar to grab training data - the foundation for every LLM out there right now.
It was meant to be an open-source compilation of the crawled internet so that research could be done on web search given how opaque Google's process is. It was NOT meant to be a cheap source of data for for-profit LLM's to train on.
(Shrug) Multiple not-for-profit LLMs have trained on it as well.
If something I worked on turned out to play a significant part in something that turned out to be that big a deal, I'd be OK with it. And nobody's stopping people from doing web-search studies with it, to this day.