Hacker News new | ask | show | jobs
by Ian_Kerins 1614 days ago
100% agree, when scraping it should always be done respectfully.

- If they provide a API, then use it.

- Don't slam a website, ideally spread it out over hours of the day when there target audience is least active (night time).

- If you can get cached data from somewhere that works, then use that.

Most developers are respectful and only scrape what they really need, not only from an ethical point of view but also a cost and resources point of view. Scraping data is resource intensive and proxy costs can quickly rise to $1,000-$10,000 per month. So most only scrape the minimum they need.

The other thing here as well, is that a lot of the most popular sites being scraped, are also massive scrapers themselves. The big ecommerce sites are being scraped, but they are also scraping their competitors too.

1 comments

Don’t get my home address, name, family members names, salary, cell phone number, aggregate and sell them and claim “it’s all publically available anyway”
If you post that data on a public domain, that is publicly available. It's like writing that info on a cardboard and putting it in the town square and then saying 'why you people steal my data!'
I disagree because there is a difference between posting something publicly for humans and posting something publicly for bots/large scale analysis. I'm ok with my employer possibly being able to see whether I am looking for a new job or not on LinkedIn if that means they would need to have a human looking at my LinkedIn page. I am not ok with them training some ML algorithm to monitor my LinkedIn page to determine how likely I am to leave the company at all times.

Another danger is when public but not easily accessible data is able to deanonymize datasets which is probably the norm rather than the exception for anonymized datasets. Sure there are technical measures to make it better, but at the end of the day I think a lot of privacy is about respecting social boundaries and not breaking these protection measures even if technically possible. Most of the time, these measures are really about keeping honest people honest and not about stopping dedicated attackers.

I have quite conscientiously never posted most of that information publicly, and yet it is for sale.