| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by throwaway6845 3299 days ago

I built a simple CRUD app for a previous (small) employer. Nothing special technology-wise, but a good concept, sound business model, and backed up with a couple of full-time staff creating content for it. Line one of the T&Cs was "no scraping". Business model was based on sales to individual users but we were prepared to do analysis in aggregate if asked.

A scraper company, funded by magic money (Knight Foundation grants) and $1m of VC, convinced a (UK) Government department to pay them to scrape our site for some analysis the department wanted. They'd never contacted us, never asked for permission, never asked if we could supply the data. Our company was bumping along at this point and having to lay people off. Income from a nice lucrative Government contract would have kept a couple more people in work.

The scraper company's FAQ was, in my view, full-on unethical:

> "we check the robots.txt file. If the site permits robots in general to scrape their site (NOT just GoogleBot!), then we will do so. We will make no effort to look for other terms and conditions as well."

You will ostentatiously "make no effort to look" for T&Cs in case they prohibit the significant contract you're about to sign with the Government? Whoa.

So how I feel about web scraping is simple: "don't be evil". If you're diverting income or traffic from the original site, don't do it. If you're genuinely adding value, go for it, but be open, be prepared to work with the original site, and be prepared to accede to their wishes.

3 comments

clamprecht 3299 days ago

Put the Terms and Conditions (the part relevant to scraping) in the /robots.txt as well.

throwaway6845 3298 days ago

Yes. Did that after this episode.

tokenizerrr 3298 days ago

Were you seriously expecting bots to read your T&C? Or anyone, for that matter? Did you mention that it was okay for Google to scrape your site?

throwaway6845 3298 days ago

We're not talking generic "bots".

We're talking a custom scraper written for this site and this site only.

Yes, I am expecting the people who spend hours inspecting the source of my site, and then writing a custom scraper for it, to spend 30 seconds reading the T&Cs first.

tokenizerrr 3298 days ago

Not sure why you'd expect that. If my webbrowser can download your source code, my software will as well.

If you want people to read it put your content behind a sign up with a checkbox.

throwaway6845 3298 days ago

It is _already_ behind a sign-up with a checkbox. They scraped their way past that too.

jlebrech 3298 days ago

you could rate limit the site and when a limit is hit replace paragraphs with lorem ipsum.

cheetos 3298 days ago

Did your service offer a paid API? Scraping happens because of a lack of better options. Surely you can understand why the scrapers didn't want to contact you beforehand.

pimmen89 3298 days ago

If you want my data on a paid API basis, then ask me about it. I need to know how big the demand is for third party users before I even prioritize building a paid API, having the god damn courtesy to ask for something would give me an idea.

If you're using my data to hijack my traffic, without asking, you could have all the right justifications in the world but you're still a prick. Who knows, maybe your orphanage building app will move me tears once I hear about it and I'll give you free access.

throwaway6845 3298 days ago

In what world is "engage a third-party scraping company" a better option than "drop a quick email to the site operator"?

binarymax 3298 days ago

Because with scraping you are in a legal grey area. But if you contact the site directly and they say "no", then there is no excuse to scrape.

throwaway6845 3298 days ago

Well, yeah, like if you ask someone to sell you their dog and they say "no". Doesn't justify stealing the dog.

binarymax 3298 days ago

Your analogy doesn't hold up. Your example is clearly theft, and is a criminal matter. Violation of a sites terms is a civil one, and again is a legal grey area. Scraping the site doesn't delete the content from a server...but there is only one copy of the dog.

EDIT - I can't reply to your comment below, but FWIW I agree that scraping sites in this manner is unethical. I am merely describing the logic that most scrapers go through for self justification and legal protection.

throwaway6845 3298 days ago

The analogy was "doing something anyway because you might be told 'no'", not "my server behaves like a canine quadruped". Copyright infringement is also potentially criminal in the UK, it's not as simple as you suggest.

But whatever. It just saddens me that the internet is a constant "don't be a dick" battle with companies like the scraper guys.

(edit - understood :) )

jbreckmckye 3298 days ago

Yes, because they did not want to pay. This is not some complex issue.

Incidentally, I'll be shopping later. May I give you $5 for you to drive to London and take me there? I was going to steal your car, but then I figured I'd be a good citizen and demand you provide it to me at your own, prohibitive, loss.