|
|
|
|
|
by Retric
803 days ago
|
|
It’s fairly trivial to treat Google’s crawler differently if you want. https://developers.google.com/search/docs/crawling-indexing/... The point here is to poison the well for freeloaders like OpenAI not to actually prevent web crawlers. OpenAI will actually pay for access to good training data, don’t hand it over for free. People don’t mindlessly click on things like terms of service crawlers are quite dumb. Little need for an arms race, as the people running these crawlers rarely put much effort into any one source. |
|
> The point here is to poison the well for freeloaders like OpenAI not to actually prevent web crawlers. OpenAI will actually pay for access to good training data, don’t hand it over for free.
Sure, and they’ll pay the scrapers you haven’t banned for your content, because it costs those scrapers $0 to get a copy of your stuff so they can sell it for far less than you.
> People don’t mindlessly click on things like terms of service crawlers are quite dumb. Little need for an arms race, as the people running these crawlers rarely put much effort into any one source.
The bots are currently dumb _because_ we don’t try to stop them. There’s no need for smarter scrapers.
Watch how quickly that changes if people start blocking bots enough that scraped content has millions of dollars of value.
At the scale of a company, it would be trivial to buy request log dumps from one of the adtech vendors and replay them so you are legitimately mimicking a real user.
Even if you are catching them, you also have to be doing it fast enough that they’re not getting data. If you catch them on the 1,000th request, they’re getting enough data that it’s worthwhile for them to just rotate AWS IPs when you catch them.
Worst case, they just offer to pay users directly. “Install this addon. It will give you a list of URLs you can click to send their contents to us. We’ll pay you $5 for every thousand you click on.” There’s a virtually unlimited supply of college students willing to do dumb tasks for beer money.
You can’t price segment a product that you give away to one segment. The segment you’re trying to upcharge will just get it for cheap from someone you gave it to for free. You will always be the most expensive supplier of your own content, because everyone else has a marginal cost of $0.