Hacker News new | ask | show | jobs
by pmoriarty 3587 days ago
"Unfortunately, if you're scraping some data that only has one authoritative data source, they'll know you're scraping them even if they can't distinguish your individual requests from the general traffic."

How would they know you're scraping them?

Surely the capability of any given website admin to detect a particular scraper would depend on many factors such as whether they're even looking for scrapers or are technologically capable of doing so, how many/which IPs the scraping is originating from, and how cleverly the scraper goes about their scraping, no?

It's a bit of a cat and mouse game, wouldn't you say?

1 comments

They know you're scraping them because their site is the only source of the data you're scraping. The most common example here is airlines. Airlines that haven't agreed to be included in fare aggregators often have their booking information scraped. Even if your traffic blends in, they know that you're reading out fare data from them, because where else would you get it from? This is especially true if you follow it up with a link to buy the specific fare at the airline's site. The only plausible way to have that is to read it off of their site (and, even if you can use a template based on their URL structure, I think there would probably be a case to be made that URLs qualify for copyright and trademark protection).

As for the game of cat and mouse, it lasts until they call in their lawyers. Then it's a game of "quit now or get destroyed".

But yes, if you can scrape the data without ever tipping off the company you're scraping, you can probably continue indefinitely, but you have to consider whether you can plausibly argue that you're getting that data from someplace else. If they sued you on the suspicion that you're scraping them, they'll probably subpoena the code to confirm that (or similar -- IANAL), and then proceed to try to make a case on things other than CFAA violations.

Oh, you're talking about them inferring that you must have scraped them because you used or published data that only they had.

Not every scraper has publishing or using data in a detectable way as their motive.

For instance, I sometimes scrape a website to make an archive of it for my own personal use. I never publish the results or use them in any way that the website owners would ever know about. So the only way they could know that they were scraped is if I left some kind of scraping signature while scraping (such as scraping from a single IP and doing it quick enough to pop on their radar or perhaps regularly enough -- ie. without random waits between request, etc).

What you're talking about is probably mostly a concern to people/companies who are somehow making money from scraping data on other people's websites.