Hacker News new | ask | show | jobs
by throwaway6845 3299 days ago
I built a simple CRUD app for a previous (small) employer. Nothing special technology-wise, but a good concept, sound business model, and backed up with a couple of full-time staff creating content for it. Line one of the T&Cs was "no scraping". Business model was based on sales to individual users but we were prepared to do analysis in aggregate if asked.

A scraper company, funded by magic money (Knight Foundation grants) and $1m of VC, convinced a (UK) Government department to pay them to scrape our site for some analysis the department wanted. They'd never contacted us, never asked for permission, never asked if we could supply the data. Our company was bumping along at this point and having to lay people off. Income from a nice lucrative Government contract would have kept a couple more people in work.

The scraper company's FAQ was, in my view, full-on unethical:

> "we check the robots.txt file. If the site permits robots in general to scrape their site (NOT just GoogleBot!), then we will do so. We will make no effort to look for other terms and conditions as well."

You will ostentatiously "make no effort to look" for T&Cs in case they prohibit the significant contract you're about to sign with the Government? Whoa.

So how I feel about web scraping is simple: "don't be evil". If you're diverting income or traffic from the original site, don't do it. If you're genuinely adding value, go for it, but be open, be prepared to work with the original site, and be prepared to accede to their wishes.

3 comments

Put the Terms and Conditions (the part relevant to scraping) in the /robots.txt as well.
Yes. Did that after this episode.
Were you seriously expecting bots to read your T&C? Or anyone, for that matter? Did you mention that it was okay for Google to scrape your site?
We're not talking generic "bots".

We're talking a custom scraper written for this site and this site only.

Yes, I am expecting the people who spend hours inspecting the source of my site, and then writing a custom scraper for it, to spend 30 seconds reading the T&Cs first.

Not sure why you'd expect that. If my webbrowser can download your source code, my software will as well.

If you want people to read it put your content behind a sign up with a checkbox.

It is _already_ behind a sign-up with a checkbox. They scraped their way past that too.
you could rate limit the site and when a limit is hit replace paragraphs with lorem ipsum.
Did your service offer a paid API? Scraping happens because of a lack of better options. Surely you can understand why the scrapers didn't want to contact you beforehand.
If you want my data on a paid API basis, then ask me about it. I need to know how big the demand is for third party users before I even prioritize building a paid API, having the god damn courtesy to ask for something would give me an idea.

If you're using my data to hijack my traffic, without asking, you could have all the right justifications in the world but you're still a prick. Who knows, maybe your orphanage building app will move me tears once I hear about it and I'll give you free access.

In what world is "engage a third-party scraping company" a better option than "drop a quick email to the site operator"?
Because with scraping you are in a legal grey area. But if you contact the site directly and they say "no", then there is no excuse to scrape.
Well, yeah, like if you ask someone to sell you their dog and they say "no". Doesn't justify stealing the dog.
Your analogy doesn't hold up. Your example is clearly theft, and is a criminal matter. Violation of a sites terms is a civil one, and again is a legal grey area. Scraping the site doesn't delete the content from a server...but there is only one copy of the dog.

EDIT - I can't reply to your comment below, but FWIW I agree that scraping sites in this manner is unethical. I am merely describing the logic that most scrapers go through for self justification and legal protection.

The analogy was "doing something anyway because you might be told 'no'", not "my server behaves like a canine quadruped". Copyright infringement is also potentially criminal in the UK, it's not as simple as you suggest.

But whatever. It just saddens me that the internet is a constant "don't be a dick" battle with companies like the scraper guys.

(edit - understood :) )

Yes, because they did not want to pay. This is not some complex issue.

Incidentally, I'll be shopping later. May I give you $5 for you to drive to London and take me there? I was going to steal your car, but then I figured I'd be a good citizen and demand you provide it to me at your own, prohibitive, loss.