Hacker News new | ask | show | jobs
by artilect 3304 days ago
This is the reason the robots.txt was created, to tell web scrapers and people building them, what is off limits.

Of course there are people building services that scrape certain sites that appear to be off limits to you.

Those people scraping sites that are explicitly prohibited either:

a. are breaking the rules, potentially the law if it's explicitly prohibited in a ToS, and will eventually have to deal with getting banned, or sued. It's quite a gray area legally but here are some laws that could be used against you:

Violation of the Computer Fraud and Abuse Act (CFAA). Violation of California Penal Code. Violation of the Digital Millennium Copyright Act (DMCA). Breach of contract. Trespass. Misappropriation. Source: Linkedin v. Doe Defendants

b. have an agreement with the website owners allowing them to scrape certain portions of their site.

c. scraping data with no rules concerning it.

For example, Facebook. has a ToS for scraping: https://www.facebook.com/apps/site_scraping_tos_terms.php At the bottom there is a form for those that want to get permission to scrape the site. And their robots.txt is heavily used to control crawlers with User-Agents they know. http://facebook.com/robots.txt

It's rare you would run into legal issues, but possible. The question is whether it's morally okay for you to scrape any data you want.