|
|
|
|
|
by jsheard
698 days ago
|
|
> I'm glad that most of them seem to respect robots.txt. https://github.com/ai-robots-txt/ai.robots.txt/blob/main/tab... Some of them identify themselves by user agent but don't respect robots.txt, so you have to set up your server to 403 their requests to keep them out. If they start obfuscating their user agents then there won't be an easy solution besides deferring to a platform like CloudFlare which offers to play that cat and mouse game on your behalf though. |
|
They also have a feature that will, if a user pastes a URL into their chat, go fetch the data and do something with it in response to the user's query. This is the feature that made a big kerfuffle on HN a while back when someone noticed it [0].
That second feature is not a web crawler in any meaningful sense of the word "crawler". It looks up exactly one URL that the user asked for and does something with it. It's Perplexity acting as a User Agent in the original sense of the word: a user's agent for accessing and manipulating data on the open web.
If an AI agent manipulating a web page that I ask it to manipulate in the way I ask it to manipulate it is considered abusive then so are ad blockers, reader mode, screen readers, dark reader, and anything else that gives me access to open web content in a form that the author didn't originally intend.
[0] https://news.ycombinator.com/item?id=40690898