| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by runbycomment 4066 days ago
	In your opinion, how much of that situation's complexity is eliminated simply by scraping the Google cache of a site? I also wonder how possible it is to hide behind proxies, especially if they are owned by entities in other countries. If a site I'm scraping is unable to identify who does the scraping, it seems difficult for them to prove "this guy uses our data and must be scraping us".

1 comments

fencepost 4066 days ago

The more you have to jump through hoops to get the data (or hide that you're getting it or that you're the one getting it), the more it sounds like doing this for the wrong reasons.

Also, since this is presumably something you're going to be doing as a hobby (money creates trails), the unfortunate reality is that "right" and "wrong" in copyright law matter much less than "Oh crap, I'm being sued for $500k in $further_away(New York|California), how do I defend this?" That's why you don't ignore the polite way of saying "go away" which is robots.txt or the rude way which is a C&D - if a lawsuit (the mean way) is the first communication you have from a company, odds are pretty good that an attorney can help because judges are busy and don't want lawsuits to be the first thing unhappy companies try.

link

runbycomment 4066 days ago

I understand what you're saying, we just come from very different perspectives. Most of my results are after significant transformation and combination, resulting in models to test against. I'm not very concerned with copyright violation, as I rarely (never?) re-publish copyrighted information.

Have there been any court cases where a person scraping public information has been found in the wrong? I know of the LinkedIn case from Jan 2014, but in that case the offenders were creating LI accounts to scrape private information. I believe that Craigslist lost it's case against e.g. padmapper, didn't they?

While I respect what you're saying in your first sentence, I view it differently. Setting aside the legal issues, I see it as someone trying to control use in a public space. I don't consider that a valid reason -- if it's public, I can consume it. Avoiding detection is a reaction to sites trying to create rules that I interpret as invalid.

If a company tried to block off a public road without legal backing, I would consider it not only my right but also my duty to traverse that road. [mediocre analogy, but it does represent my opinion fairly accurately.]

link

fencepost 4066 days ago

The things that jump out at me there are "that I interpret as invalid" and "Have there been any court cases where a person scraping public information has been found in the wrong?"

Tackling the second one first, I'd like to rephrase that: "Have all the court cases where a person was scraping public information been found in their favor and they were awarded all attorney fees and expenses?"

As far as "that I interpret as invalid" the courts exist to decide between varying interpretations of rights and laws. I've never heard that "inexpensively" was expected to be part of that description. I'm not saying that you're wrong - I'm just saying that there's a significant difference between "I'm taking on a coding and data analysis project" and "I'm taking on a coding and data analysis project with a big helping of legal distractions."

I'm not fully up on the Craigslist vs padmapper/3taps case - was it ever actually fully decided? And how much did fighting that case cost 3taps? Looking at the statement on their website it doesn't sound all that victorious, and I can't help but suspect that even ignoring whatever financial impact there was the distraction and demands of the case must have had a serious effect on any projects 3taps was working on (or considering and back-burnering) during that time.

As a counterexample since you said you were going to be keeping and displaying thumbnails, I'll toss out the artwork from "Kind of Bloop" (see http://waxy.org/2011/06/kind_of_screwed/) which was a highly-pixelated (and maybe only 8-color?) transformation of a photo of Miles Davis. TL;DR, Andy Baio ended up paying ~$32k to settle the case not because he thought he was wrong but because it was the least expensive option.

I'm not saying don't do it - I'm just saying that you should go into it with your eyes open and don't do things that will exacerbate any non-technical problems you may run into. That may be a chilling effect, but at least you can bring a coat.

link

fredophile 4066 days ago

Your public road analogy is very wrong. A better analogy would be a private road with a sign saying "Google streetview welcome. runbycomment stay out." Would you feel entitled to drive down the private road? Would the owner allowing Google to drive down the road make you feel entitled to do it?

We aren't discussing a public space. We're talking about a private server. They pay for hosting and bandwidth. Why do you feel entitled to use it?

link