| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by everforward 803 days ago

You mean human users? That is and always will be the dominant group of clients that ignore robots.txt.

What you’re talking about is an arms race wherein bots try to mimic human users and sites try to ban the bots without also banning all their human users.

That’s not a fight you want to pick when one of the bot authors also owns the browser that 63% of your users use, and the dominant site analytics platform. They have terabytes of data to use to train a crawler to act like a human, and they can change Chrome to make normal users act like their crawler (or their crawler act more like a Chrome user).

Shit, if Google wanted, they could probably get their scrapes directly from Chrome and get rid of the scraper entirely. It wouldn’t be without consequence, but they could.

1 comments

Retric 803 days ago

It’s fairly trivial to treat Google’s crawler differently if you want. https://developers.google.com/search/docs/crawling-indexing/...

The point here is to poison the well for freeloaders like OpenAI not to actually prevent web crawlers. OpenAI will actually pay for access to good training data, don’t hand it over for free.

People don’t mindlessly click on things like terms of service crawlers are quite dumb. Little need for an arms race, as the people running these crawlers rarely put much effort into any one source.

link

everforward 803 days ago

It’s trivial to treat it differently, but doing so runs the risk of being accused of cloaking and getting banned from Google’s index: https://developers.google.com/search/docs/essentials/spam-po...

> The point here is to poison the well for freeloaders like OpenAI not to actually prevent web crawlers. OpenAI will actually pay for access to good training data, don’t hand it over for free.

Sure, and they’ll pay the scrapers you haven’t banned for your content, because it costs those scrapers $0 to get a copy of your stuff so they can sell it for far less than you.

> People don’t mindlessly click on things like terms of service crawlers are quite dumb. Little need for an arms race, as the people running these crawlers rarely put much effort into any one source.

The bots are currently dumb _because_ we don’t try to stop them. There’s no need for smarter scrapers.

Watch how quickly that changes if people start blocking bots enough that scraped content has millions of dollars of value.

At the scale of a company, it would be trivial to buy request log dumps from one of the adtech vendors and replay them so you are legitimately mimicking a real user.

Even if you are catching them, you also have to be doing it fast enough that they’re not getting data. If you catch them on the 1,000th request, they’re getting enough data that it’s worthwhile for them to just rotate AWS IPs when you catch them.

Worst case, they just offer to pay users directly. “Install this addon. It will give you a list of URLs you can click to send their contents to us. We’ll pay you $5 for every thousand you click on.” There’s a virtually unlimited supply of college students willing to do dumb tasks for beer money.

You can’t price segment a product that you give away to one segment. The segment you’re trying to upcharge will just get it for cheap from someone you gave it to for free. You will always be the most expensive supplier of your own content, because everyone else has a marginal cost of $0.

link

Retric 803 days ago

Google doesn’t care what you do to other crawlers that ignore your TOS. This isn’t a theoretical situation it’s already going on. Crawling is easy enough to “block” there’s court cases on this stuff because this is very much the case where the defense wins once they devote fairly trivial resources to the effort.

And again blocking should never be the goal poisoning the well is. Training AI on poisoned data is both harder to detect and vastly more harmful. A price compared tool is only as good as the actual prices it can compare etc.

link