|
|
|
|
|
by everforward
803 days ago
|
|
You mean human users? That is and always will be the dominant group of clients that ignore robots.txt. What you’re talking about is an arms race wherein bots try to mimic human users and sites try to ban the bots without also banning all their human users. That’s not a fight you want to pick when one of the bot authors also owns the browser that 63% of your users use, and the dominant site analytics platform. They have terabytes of data to use to train a crawler to act like a human, and they can change Chrome to make normal users act like their crawler (or their crawler act more like a Chrome user). Shit, if Google wanted, they could probably get their scrapes directly from Chrome and get rid of the scraper entirely. It wouldn’t be without consequence, but they could. |
|
The point here is to poison the well for freeloaders like OpenAI not to actually prevent web crawlers. OpenAI will actually pay for access to good training data, don’t hand it over for free.
People don’t mindlessly click on things like terms of service crawlers are quite dumb. Little need for an arms race, as the people running these crawlers rarely put much effort into any one source.