Hacker News new | ask | show | jobs
by avallach 316 days ago
Cloudflare did explain a proper solution: "Separate bots for separate activities". E.g. here: one bot for scraping/indexing, and one for non-persistent user-driven retrieval.

Website owners have a right to block both if they wish. Isn't it obvious that bypassing a bot block is a violation of the owners right to decide whom to admit?

Perplexity's almost seems to believe that "robots.txt was only made for scraping bots, so if our bot is not scraping, it's fair for us to ignore it and bypass the enforcement". And their core business is a bot, so they really should have known better.

2 comments

They're already doing that https://docs.perplexity.ai/guides/bots There's PerplexityBot and Perplexity‑User.
And then once they see that the website operator blocked the perplexity-user, apparently instead of respecting that, they not only ignore robots.txt, but actively try to bypass the security measures established with the explicit purpose of limiting their access. If this was about bypassing DRM rather than AI-WAF, it would be plainly illegal.

To me this invalidates their whole claim that Cloudflare fails to tell the difference between scraper and user-driven agent. Instead, distinguishing them is trivial, and the block is intentional.

I use Perplexity regularly for research because it does a good job accessing, preprocessing and citing relevant resources. Which do you think is better: the service respects my desire for it to do a good job and ignore site owners blocking agent access because "don't like automated agents", or the service respects said site owners' - what I consider unreasonable - desires and not do a good job for me? Expand to the inevitably increasing LLM-for-research user base.
I can totally see your point. It's a bit like that fight of news agencies against the free snippets and aggregations on 3rd party websites. The Internet is supposed to be open after all.

But it also feels like essentially "pirating" the webpages while erasing their brand. Maybe it's even a tolerable transitive situation, but you can't even argue it's beneficial in the same way as game piracy could be according to some. In the long term, we need an incentive for the content creators to willingly allow such processing. Otherwise, a lot of high quality content will eventually become members-only with DRM-like anti agent protections.

The incentive doesn't have to be monetary. I could for example imagine some website owners allow AI agents that commit to upfront verbatim repeating some sort of mandatory headers/messages/acknowledgements from the content authors, before copying or summarizing, and are known to stick to this commitment.

You can also bypass the problem already now by accessing and copying the content manually, and then putting it in the context of a tool like NotebookLM. Nobody's hurt, because you have actually seen the source by yourself, and that's all the website owners can reasonably demand.

TL;DR: why even post quality content in open if the audience won't see your ads, your donation button, or even your name. What do you think?

This kind of makes sense for chatgpt and others. But perplexity links to your content directly. I end up clicking more perplexity sources than search results in practice. I don't know how well that generalises, but the traffic is not just going away.
> In the long term, we need an incentive for the content creators to willingly allow such processing. Otherwise, a lot of high quality content will eventually become members-only with DRM-like anti agent protections.

I partially agree with this. Yes, some incentive is OK, for some cases. I wouldn't be OK with a mandatory header/message for example showing up in my output, unless there's some very direct relevance to the content. But there could be some kind of tipper markup/code embedded in the site metadata that my agent abstracts away as content rating feedback options, and tips automatically made on my behalf if I have it configured and selected the "useful" option. Of course source citation should also be a mandatory part of the output, for that branding and also in case there's desire to go beyond the output.

However, there will also always be content authors out there who share quality content freely with no expectation of any kind of return. The "problem" is that such content usually isn't SEO-optimized, and so likely won't be in the top results. There will be little lost if those optimizing for return start blocking their content as they'll also be automatically deranked, by virtue of content access issues, and the non-optimized content will then rise to the surface.

TL;DR: suggested configurable creator-tipping system abstracted behind feedback options, and the likely case that those who block access will be deranked in favor of those maintaining open access.

> bypassing a bot block is a violation of the owners right to decide whom to admit?

There is only a violation if the bot finds a way around a login block. Same for human. But whatever is on the public web is... public. For all.

So it's ok to block someone "because you didn't include a session token I gave you in exchange for knowing the password" but it's not ok to block someone "because you didn't stick to manually-operated user agents as I told you via robots.txt"? What about not letting someone play level 42 "because you didn't complete level 41"?

A web server providing a response to your request is akin to a restaurant server doing the same. Except for specific situations related to civil rights, they are free to not deal with you for any reason.

Typically when something is behind a login, it denotes a private space intended for a particular set of persons given explicit access. It's senseless to block people from using agents if the same people would otherwise have access, unless there is an abuse of that access, ie. action which is to the detriment of the space. And though some of that does happen, it obviously isn't the full story. I have a Perplexica instance running locally that I sometimes use (but often don't as Perplexity does a much better job). Should that also be blocked?

Hmm maybe a civil case could be potentially made here too, re disability. By blocking LLM use, sites are reducing the ability of select users to reasonably interact with the content. Just could become a thing in a few years if this nonsense continues.