| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by etler 770 days ago

It was dumb of Perplexity to specifically say that they follow robots.txt, and then not respect it immediately after. I doubt it was a consciously malicious decision as it wouldn't make any sense to try and get some good will from that commitment and then expect nobody to notice that you weren't actually doing that. I expect that they simply forgot they even made that promise. That's still a problem though. Too many companies sloppily make pledges because they think it will make them look good only to immediately stop caring about the pledge right after they say it.

While it was definitely wrong of them to commit to it and renege, I don't really think robots.txt should be expected to be respected for a live web scraping use case. I think performing a web request immediately after a user requests it is much more similar to a proxy service than it is to an indexing service like search engines. Should archive.is also be expected to respect robots.txt as well, even though it's directly performing a scrape on the user's behalf? How is live scraping materially different than a proxy or a CDN? robots.txt was created in the context of crawlers scraping content on a batch schedule as opposed to a real time on the fly schedule.

Live request services could even leverage the end user's machine to perform the requests directly and then provide the data to them instead of doing the scrape on their behalf, so I feel as though the directly responsible invoker of the data being taken off their sites is more the user than the service that's acting as a proxy.

The current media order is definitely at risk and it's something we need to find solutions to protect in some way to prevent reporting from dying in this paradigm shift. Trying to push against the entirety of AI progress is not going to work and this is really just screaming into the void over technicalities. Even if live scrape powered AI services were banned, the same service would just move to the end user with apps that will perform the requests directly on the user's device. I don't think anyone outside of the industry cares about the technical nuances of how AI services visit websites for the user. There's a bigger problem at play here regarding the future of high quality media, and it needs to be addressed directly.