Hacker News new | ask | show | jobs
by stingraycharles 1489 days ago
There is no clear “right” or “wrong” side here, although not honoring robots.txt is a bit frowned upon, comparable to someone telling you certain areas are off-limits, but you decide to go there anyway.

Having said that, if you’re in the web scraping business dealing with anti-scrape shields and whatnot, ignoring robots.txt is the least nefarious of them all.

1 comments

My personal point of view (my opinion, not my company's) is that I think it's unethical to take advantage of all the benefits public data provides which lets be honest are absolutely massive - like search engine indexing, brand building, content previews etc. while at the same time avoid paying the costs of having data publically - like someone automating their browsing experience.

I mean, we had a solution to web scrapers since the inception of web authentication, but the value of having the data publically clearly outweighs the costs of having your data scraped to the point where big corporations would rather take web scrapers all the way to Ninth Circuit (Linkedin case) than shut down the public access.

That being said our understanding of information philosophy is still in complete infancy so it's hard to discuss ethics here. Generally, I'm in favor of hackers, individuals and decentralization over big corporations and access to web scraping empowers the former and weakens latter - so, I'm rooting for the healthier, better version of the internet above all!

You make fair points about the negative side of the web in an unideal world.

But please stop framing your encouragement of toxic crawling practices as some sort of noble pursuit in a made-up fight against The Man.

Just own it as the "I'm-alright-Jack" approach it is; the honesty will make it a more respectable position intellectually, even if it remains unethical.