| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cookiecaper 3587 days ago

>Reading the previous thread again, I suppose that many of those against scraping didn't realized they've already lost : with Ghost, Phantom, and now headless Chrome you're going to have a hard time to detect a well built scraper.

Unfortunately, if you're scraping some data that only has one authoritative data source, they'll know you're scraping them even if they can't distinguish your individual requests from the general traffic.

This is what happened to my company. It didn't stop them from pretending that we were setting their servers on fire, even though they had no way to know whether we were or not since they couldn't distinguish our traffic from that generated by other browsers.

We were scraping only factual data in the which the company cannot hold a copyright interest. Nonetheless, under Ticketmaster v. RMG, just holding a copy of a page in RAM long enough to parse it constitutes infringement (you have to prove fair use, as Google supposedly did in Perfect 10 v. Google, to avoid this).

The difference between yourself and Google/airbnb is that the latter have a lot of money and are trendy technology companies, and you don't and aren't (yet).

The lesson is become really big before someone sues you and the judiciary will be on your side.

1 comments

pmoriarty 3587 days ago

"Unfortunately, if you're scraping some data that only has one authoritative data source, they'll know you're scraping them even if they can't distinguish your individual requests from the general traffic."

How would they know you're scraping them?

Surely the capability of any given website admin to detect a particular scraper would depend on many factors such as whether they're even looking for scrapers or are technologically capable of doing so, how many/which IPs the scraping is originating from, and how cleverly the scraper goes about their scraping, no?

It's a bit of a cat and mouse game, wouldn't you say?

link

cookiecaper 3587 days ago

They know you're scraping them because their site is the only source of the data you're scraping. The most common example here is airlines. Airlines that haven't agreed to be included in fare aggregators often have their booking information scraped. Even if your traffic blends in, they know that you're reading out fare data from them, because where else would you get it from? This is especially true if you follow it up with a link to buy the specific fare at the airline's site. The only plausible way to have that is to read it off of their site (and, even if you can use a template based on their URL structure, I think there would probably be a case to be made that URLs qualify for copyright and trademark protection).

As for the game of cat and mouse, it lasts until they call in their lawyers. Then it's a game of "quit now or get destroyed".

But yes, if you can scrape the data without ever tipping off the company you're scraping, you can probably continue indefinitely, but you have to consider whether you can plausibly argue that you're getting that data from someplace else. If they sued you on the suspicion that you're scraping them, they'll probably subpoena the code to confirm that (or similar -- IANAL), and then proceed to try to make a case on things other than CFAA violations.

link

pmoriarty 3587 days ago

Oh, you're talking about them inferring that you must have scraped them because you used or published data that only they had.

Not every scraper has publishing or using data in a detectable way as their motive.

For instance, I sometimes scrape a website to make an archive of it for my own personal use. I never publish the results or use them in any way that the website owners would ever know about. So the only way they could know that they were scraped is if I left some kind of scraping signature while scraping (such as scraping from a single IP and doing it quick enough to pop on their radar or perhaps regularly enough -- ie. without random waits between request, etc).

What you're talking about is probably mostly a concern to people/companies who are somehow making money from scraping data on other people's websites.

link