|
|
|
|
|
by lolinder
698 days ago
|
|
The entry here for Perplexity is the one that got a lot of attention but it's also unfair: PerplexityBot is their crawler, which uses that user agent and as far as anyone can tell it respects robots.txt. They also have a feature that will, if a user pastes a URL into their chat, go fetch the data and do something with it in response to the user's query. This is the feature that made a big kerfuffle on HN a while back when someone noticed it [0]. That second feature is not a web crawler in any meaningful sense of the word "crawler". It looks up exactly one URL that the user asked for and does something with it. It's Perplexity acting as a User Agent in the original sense of the word: a user's agent for accessing and manipulating data on the open web. If an AI agent manipulating a web page that I ask it to manipulate in the way I ask it to manipulate it is considered abusive then so are ad blockers, reader mode, screen readers, dark reader, and anything else that gives me access to open web content in a form that the author didn't originally intend. [0] https://news.ycombinator.com/item?id=40690898 |
|
The action is indeed prompted by a human, but so is any crawl in some way. At some point they either configured an interval or other trigger to send the script to the Web host to fetch anything it can find.
It's inherently different to extensions such as adblockers that just remove elements according to configuration.
After all, the users device will never even see the final DOM now. instead it's getting fetched, parsed and processed on a third device, which is objectively a robot. You'd be able to make that argument only if it was implemented via an extension (users device fetched the page and posts the final document to the LLM for processing).
And that's ignoring the fact that adblockers are seen as illegitimate by a lot of websites too, and they often try to block access to people using these extensions too.