| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by throwwebmaster 1777 days ago
	Don’t most websites prohibit someone from scraping and harvesting the data on them? Most recently, I can think of Yelp, Amazon, and GitHub prohibiting this, as well as the Aaron Swartz case.

4 comments

apazzolini 1777 days ago

Up until June 14th of this year, the ruling was that scraping is legal from the HiQ vs LinkedIn lawsuit[0].

While finding that link for you, I learned that since that date, the Supreme Court vacated the decision back to the lower courts in light of a new decision of theirs.[1]

Now I don't think it's clear one way or the other just yet. Any lawyers here with an opinion on how this is going to go? I haven't found any analysis.

[0] https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn

[1] https://www.reuters.com/technology/us-supreme-court-revives-...

jdminhbg 1777 days ago

Scraping is legal, but that doesn't mean the site has to make it easy or possible for you. You can scrape my site, but I can also IP ban scrapers.

bilbo0s 1777 days ago

Well given all this and the consent decree, I'm kind of seeing why FB made this decision. If it's legal, let the FTC say so explicitly.

It's probably very bad strategy to allow the privacy leak, and then hope that the FTC agrees with your decision later. No one will be sanctioned for adhering too strictly to consent decrees, but you could be sanctioned for being too loose. So the choice is obvious in that light.

throwwebmaster 1774 days ago

It will be interesting to see how the FTC will rectify its ongoing fight to hold FB more accountable for protecting its users’ private data with the FTC’s ostensibly contradictory position to allow mass scraping of private user data at scale in this case. Cambridge Analytica was doing roughly the exact same thing, which was precisely what motivated the FTC's involvement in the first place.

kcplate 1777 days ago

“Prohibit” is an interesting word that in the world of public web publishing of data is actually pretty meaningless. Certainly companies take measures to prevent bulk scraping when they can detect it and have legal remedies if copyrighted or owned material is republished. But simply telling me I am prohibited to hold on to a copy of your site’s data is pretty meaningless when my browser caches your page and you want my browser to cache it to speed your site up.

throwwebmaster 1777 days ago

By accessing the data on a website, you are forming a contract with that website and are subject to the website’s terms and conditions.

As a website owner, beyond that, there is no need to tell someone they cannot access your site if you simply block them from accessing your servers instead using a multitude of techniques.

Where does caching come into play at all here? You cannot cache content to begin with if the server is blocking access in the first place. And if you have already cached it in the act of violating said website’s terms of service, then you are still not in compliance.

wkavey 1777 days ago

I'm pretty sure that first paragraph is false. Until I explicitly agree to a contract, or "terms and conditions", I am not bound by anything. If a site I navigate to embeds content from another website I am not immediately bound by the terms and conditions of that neclsred site, to think otherwise would invite madness.

Not to mention the fact that terms and conditions are not contracts. I don't think they carry the same weight, although someone please correct me on this if I am incorrect.

kcplate 1776 days ago

Plaintiff: “Judge, when the defendant used my public website, a contract with me was implicitly made”

Judge: “what defendant? There is no one here.”

Plaintiff: “Oh he was anonymous, so I am not sure who it was…”

Judge: “Hmm interesting, so you seem to think an implicit contract exists that you want to enforce with no documentation at all with a party you can’t name, because you are not sure who it is?”

Plaintiff: “Exactly.”

Judge: “Feel free to come back when you aren’t going to waste the court’s time”

throwwebmaster 1774 days ago

This clear instance of reductio ad absurdum is wholly non-analogous. Factually inaccurate court proceeding depictions and legal misunderstandings aside, for one thing, the non-hypothetical defendant’s identity is well-known in this specific instance.

kcplate 1774 days ago

Interesting. I would have thought that someone with the ability to craft such a sesquipedalian response would also have been capable of understanding irony.

Also…you missed the word “public”, again.

throwwebmaster 1774 days ago

Less speculation and more facts:

https://www.upcounsel.com/are-website-terms-and-conditions-l...

This is heavily supported by Case Law.

nieve 1777 days ago

Not according to some rulings the last few years. A contract requires a meeting of minds.

throwwebmaster 1774 days ago

No.

According to many rulings the last few years, continually and systematically accessing a third party’s data under the clear expectation that you are aware of (as well as agreed to) their terms is definitely a meeting of minds.

kcplate 1776 days ago

I think you missed the word “public” on my comment. If you are posting content for public consumption, unless that content is copyrighted by you AND I republish/sell/etc it. You basically don’t have much to say if I choose to keep a copy of it.

If you don’t want me to have a copy of it for any reason, don’t let me have it at all.

throwwebmaster 1774 days ago

Is it public though? For one thing, you need to have a registered user account and be part of the targeting audience as a winning biddee for ad placement in order to see the ads in question. As an ordinary person, unless you share your private login credentials with me (which would be another ToS violation), I cannot view nor access the ads you have been shown.

kcplate 1774 days ago

You seem to be hung up on users and passwords, I am talking anonymous public accessible sites.

If you publish content on a website that willingly provides data to anonymous users of your site, even with a TOC on the site, the TOC is not enforceable if you cannot prove that the user explicitly agreed to the TOC. If you don’t know who the user is, you can’t prove that they agreed to your TOC.

Having a TOC is basically legal theater if you allow anonymous users. The implied threat is basically “IF we find out who you are” and you use the site in a way that is contrary to our published TOC, we will take action against you.

Your only recourse in that case is to pursue sites that are republishing your copyrighted content…because only at that point can you actually identify the party that may be misusing your site and it’s content.

throwwebmaster 1774 days ago

In reply to your sibling comment: I actually agree with you in the anonymous case and point out that many of the examples named (including FB, the topic of discussion) require user authentication.

ClumsyPilot 1777 days ago

"most websites prohibit someone from scraping"

This weired mindset where corporations make law shouws up again.

The websites have no power to prohibit anything. If they make bytes avaliable, we may do with them as we please so long as its legal.

Sebb767 1777 days ago

They can't throw you in jail over it, but it's within their rights to stop sending you these bytes or kick you off their platform altogether.

If they'd try to prevent you from scraping third-party sites it would be making laws; setting up ground rules with their ToS and enforcing them is absolutely fine.

dathinab 1776 days ago

> but it's within their rights to stop sending you these bytes or kick you off their platform altogether.

Actually no.

If a platform provides a generally available service they are (in many countries, idk. about the US) not allowed to arbitrary exclude some people they don't like without a legal valid reason.

And braking legally not valid/binding terms in a ToS is not a legal valid reason. Just because you write something in your ToS doesn't mean it has any legal relevant meaning, there are limits to what you can put in ToS. And limiting (properly done, privacy respecting) research is often not valid. (Through depends a lot on the country.)

throwwebmaster 1774 days ago

The FTC is a US government entity.

ClumsyPilot 1776 days ago

Imagine I am scraping Twitter - maybe I never accepted their TOS and don't even have an account.

throwwebmaster 1774 days ago

Interesting that you mentioned Twitter. Twitter requires a user account to access content. Try accessing Twitter while logged out.

throwwebmaster 1777 days ago

It’s within anyone’s rights as website maintainers to block malicious IP addresses that scrape or otherwise within their discretion.

Nobody is legally forcing websites to allow access to everyone, and accordingly, nobody is altering the law by blocking access to people (crawlers, hackers, spammers, malcontents, or anybody really) that they feel are not welcome. So exercising one’s existing rights isn’t an act of making or altering laws.

I suggest reading up on what robots.txt is to further understand this.

ClumsyPilot 1776 days ago

Hacking and other malicious behavious are actually illegal.

Either Crawling does not belong on that list, or google exects should be in jail.

Given that crawling is not malicious, what we are discussing now is 'someone is crawling my website in a way I dont like' which is a different gripe.

It mighthave some merit, but robots txt is not legally binding.

throwwebmaster 1776 days ago

As I stated before, Illegal or not, it is within the website owners’ rights to restrict access.

literallyaduck 1776 days ago

Are these "researchers" going to be charged the same as the martyr Swartz?

throwwebmaster 1774 days ago

The parties here would be the US FTC and FB, so no.