| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by reefoctopus 2553 days ago
	Teach your students to ensure there’s a delay between requests so they aren’t hammering anyone’s server, and follow the rules in the robots.txt. I’ve scraped more than a billion pages without any issues.

3 comments

avip 2553 days ago

He asked about legality, not technical difficulty.

link

malshe 2553 days ago

Actually many of the students are technically competent to do the scrpaing mostly using Python and I am pretty sure they learned not to overwhelm web servers.

link

gshdg 2553 days ago

Just because it’s technically feasible does not mean it’s legal or ethical.

link

dheera 2553 days ago

I'd say the act of web scraping alone, is almost never unethical if you are careful to not cause undue load to servers. From an ethics, not legal perspective, I don't see a whole lot of difference between your computer's silicon eyes and your organic eyes just looking at something that's already in plain view.

It might be illegal in some jurisdiction; IANAL but I think you can just get out of that jurisdiction and scrape away if that is the case. It might violate some ToS but ToS isn't law; the consequences of violating a ToS are usually on the order of getting your IP banned.

What you do with the stuff you scraped can be ethical or unethical.

link

reefoctopus 2553 days ago

What makes it unethical?

Why should I be treated differently than search engine spiders?

If somebody doesn’t want their site scraped then they can let people know with robots.txt. Get off your high horse.

link

patrickmcnamara 2553 days ago

They never said it was unethical.

link

matz1 2553 days ago

Likewise, just because its not legal or in some perspective its unethical, doesn't mean one should not do it.

link