Hacker News new | ask | show | jobs
by tangue 3588 days ago
Reading the previous thread again, I suppose that many of those against scraping didn't realized they've already lost : with Ghost, Phantom, and now headless Chrome you're going to have a hard time to detect a well built scraper.

Instead of fighting against scrapers that don't want to harm you, maybe it's about time to invest in your robots.txt and cooperate.

You could say that scraping you're website is FORBIDDEN, but come on : if Airbnb can rent houses, I can scrap you site.

2 comments

>Reading the previous thread again, I suppose that many of those against scraping didn't realized they've already lost : with Ghost, Phantom, and now headless Chrome you're going to have a hard time to detect a well built scraper.

Unfortunately, if you're scraping some data that only has one authoritative data source, they'll know you're scraping them even if they can't distinguish your individual requests from the general traffic.

This is what happened to my company. It didn't stop them from pretending that we were setting their servers on fire, even though they had no way to know whether we were or not since they couldn't distinguish our traffic from that generated by other browsers.

We were scraping only factual data in the which the company cannot hold a copyright interest. Nonetheless, under Ticketmaster v. RMG, just holding a copy of a page in RAM long enough to parse it constitutes infringement (you have to prove fair use, as Google supposedly did in Perfect 10 v. Google, to avoid this).

The difference between yourself and Google/airbnb is that the latter have a lot of money and are trendy technology companies, and you don't and aren't (yet).

The lesson is become really big before someone sues you and the judiciary will be on your side.

"Unfortunately, if you're scraping some data that only has one authoritative data source, they'll know you're scraping them even if they can't distinguish your individual requests from the general traffic."

How would they know you're scraping them?

Surely the capability of any given website admin to detect a particular scraper would depend on many factors such as whether they're even looking for scrapers or are technologically capable of doing so, how many/which IPs the scraping is originating from, and how cleverly the scraper goes about their scraping, no?

It's a bit of a cat and mouse game, wouldn't you say?

They know you're scraping them because their site is the only source of the data you're scraping. The most common example here is airlines. Airlines that haven't agreed to be included in fare aggregators often have their booking information scraped. Even if your traffic blends in, they know that you're reading out fare data from them, because where else would you get it from? This is especially true if you follow it up with a link to buy the specific fare at the airline's site. The only plausible way to have that is to read it off of their site (and, even if you can use a template based on their URL structure, I think there would probably be a case to be made that URLs qualify for copyright and trademark protection).

As for the game of cat and mouse, it lasts until they call in their lawyers. Then it's a game of "quit now or get destroyed".

But yes, if you can scrape the data without ever tipping off the company you're scraping, you can probably continue indefinitely, but you have to consider whether you can plausibly argue that you're getting that data from someplace else. If they sued you on the suspicion that you're scraping them, they'll probably subpoena the code to confirm that (or similar -- IANAL), and then proceed to try to make a case on things other than CFAA violations.

Oh, you're talking about them inferring that you must have scraped them because you used or published data that only they had.

Not every scraper has publishing or using data in a detectable way as their motive.

For instance, I sometimes scrape a website to make an archive of it for my own personal use. I never publish the results or use them in any way that the website owners would ever know about. So the only way they could know that they were scraped is if I left some kind of scraping signature while scraping (such as scraping from a single IP and doing it quick enough to pop on their radar or perhaps regularly enough -- ie. without random waits between request, etc).

What you're talking about is probably mostly a concern to people/companies who are somehow making money from scraping data on other people's websites.

It depends on your definition of harm. When your product is what's published on the websites and you regularly find ripoffs of said website publishing your ripped off content, maybe you'd feel differently about it.
Not sure what that has to do with scraping. A desktop browser can be used to copy and paste chunks of content and plagiarize a site. We have reasonable copyright protections to protect authors against that. What we need to discharge are the unreasonable laws regarding network access. It's not all or nothing.
Yeah, but that's not just because of web scraping. Plagiarism has been an issue for centuries.
fair enough but I don't think that's the main purpose. There are many many cases where you would want to scrape something and often people would probably be encouraged in doing so in a "polite" way if websites didn't make it hard.
Yes, or if they just provided a csv with all the data most people wanted to scrape anyway with a plain English explanation about how it can be used.
That argument only holds up if you believe in intellectual property. Many of us here do not.