| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cjf101 626 days ago
	There was a bunch of reporting on how AI companies and researchers were using tools that ignored robots.txt. It's a "polite request" that these companies had a strong incentive to ignore, so they did. That incentive is still there, so it is likely that some of them will continue to do so.

1 comments

Ukv 626 days ago

CommonCrawl[0] and the companies training models I'm aware of[1][2][3] all respect robots.txt for their crawling.

If we're thinking of the same reporting, it was based on a claim by TollBit (a content licensing startup) which was in turn based the fact that "Perplexity had a feature where a user could prompt a specific URL within the answer engine to summarize it". Actions performed by tools acting as a user agent (like archive.today, or webpage-to-PDF site, or a translation site) aren't crawlers and aren't what robots.txt is designed for, but either way the feature is disabled now.

[0]: https://commoncrawl.org/faq

[1]: https://platform.openai.com/docs/bots

[2]: https://support.anthropic.com/en/articles/8896518-does-anthr...

[3]: https://blog.google/technology/ai/an-update-on-web-publisher...

link

cjf101 626 days ago

These policies are much clearer than they were when last I looked, which is good. On the other hand. Perplexity appeared to ignore robots.txt as part of a search-enhanced retrieval scheme, at least as recently as June of this year. The article title is pretty unkind, but the test they used pretty clearly shows what was going on.

https://www.wired.com/story/perplexity-is-a-bullshit-machine...

It takes this sort of critical scrutiny, otherwise mechanisms like robots.txt do get ignored, whether willfully or mistakenly.

link

Ukv 624 days ago

> The article title is pretty unkind, but the test they used pretty clearly shows what was going on.

I believe this article is around the same misunderstanding - it doesn't appear to show any evidence of their crawler, or web scraping used for training, accessing pages prohibited by robots.txt.

link

FrustratedMonky 626 days ago

Robots.txt is a suggestions. As is reporting on using it.

The companies that are ignoring robots.txt, are also probably the companies not advertising that they are ignoring robots.txt.

link

Ukv 626 days ago

The EU's AI act points to the DSM directive's text and data mining exemption, allowing for commercial data mining so long as machine-readable opt-outs are respected - robots.txt is typically taken as the established standard for this.

In the US it is a suggestion (so long as Fair Use holds up) but all I've seen suggests that the major players are respecting it, and minor players tend to just use CommonCrawl which also does. Definitely possible that some slip through the cracks, but I don't think it's as useless as is being suggested.

link

FrustratedMonky 626 days ago

Technically, robot.txt isn't enforcing anything, so it is just trust.

""OpenAI CTO doesn't know what data was used to train the company's video generating platform, Sora""

https://www.youtube.com/watch?v=4AYbZG3h14w

Funny. If I can browse to it, it is public right? That is how some people's logic goes. And how OpenAI argued 2 years ago when GPT3.5/ChatGPT first started getting traction.

link

Ukv 626 days ago

> Technically, robot.txt isn't enforcing anything, so it is just trust.

There's legal backing to it in the EU, as mentioned. With CommonCrawl you can just download it yourself to check. In other cases it wouldn't necessarily be as immediately obvious, but through monitoring IPs/behavior in access logs (or even prompting the LLM to see what information it has) it would be possible to catch them out if they were lying - like Perplexity were "caught out" in the mentioned case.

> Funny. If I can browse to it, it is public right? That is how some people's logic goes. And how OpenAI argued 2 years ago when GPT3.5/ChatGPT first started getting traction.

If you mean public as in the opposite of private, I think that's pretty much true by definition. Information's no longer private when you're putting it on the public Internet.

If you mean public as in public domain, I don't think that has been argued to be the case. The argument is that it's fair use (that is, the content is still under copyright, but fitting statistical models is substantially transformative/etc.)

link