Hacker News new | ask | show | jobs
by karterk 1489 days ago
Do you follow robots.txt or do you allow your customers to bypass restrictions placed by the website? How do you feel about bypassing that? I know it's not illegal but it's certainly not ethical either.
2 comments

Could you expand on why you think it's ethically required to follow robots.txt instructions?

Primary argument _in favor_ of automation (e.g. web scraping) is that it would be unethical to hire hudnreds people to do meanial, unfufilling tasks like mindlessly clicking around the website and saving the pages when it could be done by a program which is countless times more efficient for everyone involved (the website included) and safer.

The ethical argument has nothing to do with hiring hundreds of people to circumvent robots.txt requirements!

The whole point is to avoid unnecessary or excessive crawling by bots that are engineered with no concern for anything other than the owners motivations, presumably financial in most cases.

Sure. At earlier jobs, I often got paged because of some bot running wild in crawling our web pages. Some pages are heavy and we don't expect them to be hit often, but once you crawl these pages indiscriminately (even if accidentally) that can bring a site down. There are also some pages whose underlying resources are billed in a pay-as-you-use model. Once again, heavy bot traffic ran up our bills.

Robots.txt allows the site owners to restrict such pages from being crawled by bots. Services that allow people to circumvent the restrictions are being rude to say the least. Many crawling services also use a farm of proxies that spoof their real identity with fake user agents to circumvent rate limiting etc. All of these "strategies" go far beyond basic automation and is quite shady in reality.

There's actually a difference between crawling and web scraping. Crawling discovers pages in a very loose manner by following all links, digesting them and producing more crawl tasks. Web Scraping, on the other hand, is a more controlled environment where the rules are pretty strict e.g. scrape `produc-<product id>.html` links for product data so web scrapers are very unlikely to stumble on some page randomly.

Also, unfortunately, robots.txt is rarely used to indicate non crawlable endpoints these days but instead is used as a way to withold public data. Just take any random big website and take a look at their robots.txt file:

User-agent: Googlebot Allow: / User-agent: * Disallow: /

There is no clear “right” or “wrong” side here, although not honoring robots.txt is a bit frowned upon, comparable to someone telling you certain areas are off-limits, but you decide to go there anyway.

Having said that, if you’re in the web scraping business dealing with anti-scrape shields and whatnot, ignoring robots.txt is the least nefarious of them all.

My personal point of view (my opinion, not my company's) is that I think it's unethical to take advantage of all the benefits public data provides which lets be honest are absolutely massive - like search engine indexing, brand building, content previews etc. while at the same time avoid paying the costs of having data publically - like someone automating their browsing experience.

I mean, we had a solution to web scrapers since the inception of web authentication, but the value of having the data publically clearly outweighs the costs of having your data scraped to the point where big corporations would rather take web scrapers all the way to Ninth Circuit (Linkedin case) than shut down the public access.

That being said our understanding of information philosophy is still in complete infancy so it's hard to discuss ethics here. Generally, I'm in favor of hackers, individuals and decentralization over big corporations and access to web scraping empowers the former and weakens latter - so, I'm rooting for the healthier, better version of the internet above all!

You make fair points about the negative side of the web in an unideal world.

But please stop framing your encouragement of toxic crawling practices as some sort of noble pursuit in a made-up fight against The Man.

Just own it as the "I'm-alright-Jack" approach it is; the honesty will make it a more respectable position intellectually, even if it remains unethical.

The robots.txt time makes it at times easier to scrape a target by the info that a website can reveal in it (e.g. allow a specific bot to scrape all). Their sitemaps are another gem.