| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Cynddl 691 days ago
	I find it interesting that as an (edit: UK) academic researcher, I would be likely be forbidden to use tools like this, that fail basic ethics standards, regulations such as GDPR, and practical standards such as respecting robots.txt [given there's no information on embedding.io, it's unlikely I can block the crawler when designing a website]. There's still room for an ethical development of such crawlers and technologies, but it needs to be consent-first, with strong ethical and legal standards. The crazy development of such tools has been a massive issue for a number of small online organisations that struggle with poorly implemented or maintained bots (as discussed for OpenStreetMap or Read The Docs).

1 comments

popcorncowboy 690 days ago

I'm less convinced. Are you saying it's unethical to automate browsing a site?

Because if you save the pages you browse on some site, they're yours (authors don't own your cache).

Perhaps you're arguing that if you wrote a lightweight script/browser (which is just your user agent) to save some website for offline use, that'd be unethical and GDPR violating? Again, I don't think so but maybe I'm missing something. But perhaps this turns on what defines a "user agent".

Perhaps this becomes a "depth of pre-fetch" question. If your browser prefetches linked pages, that's "automated" downloading, akin to the script approach above. Downloading. To your cache. Which you own. (Where I struggle to see an ethical violation)

Genuinely curious where the line is, or what exactly here is triggering ethics, GDPR and practical standards?

Cynddl 690 days ago

Maybe a good illustration would be ClearView AI. They are scraping websites, extracting information (images), and training ML models to learn embeddings (distance between faces). They indiscriminately collect personal data without opt-in, but a limited opt-out mechanism.

In this case, if this tool is used to scrape a website, there are too direct issues: 1/ no immediate way for the website owner to exclude this particular scraper (what is the useragent?) 2/ no way for data subjects (whose data is present on the website) to search whether the scraper learned their personal data in the embeddings. Data being available publicly doesn't mean it can be widely used [at least outside the US, where we have much stricter rules on scraping].