Hacker News new | ask | show | jobs
by pixelmonkey 2434 days ago
The summary here is that LinkedIn tried to argue that it could prevent scraping of public LinkedIn profile data under their ToS, but the courts have ruled that if data is public and provided by users, it can be scraped/crawled, that is, it isn’t LinkedIn property. This is generally a positive outcome for people/companies turning web text and HTML into structured data, e.g. tools like Puppeteer and Scrapy can be used more freely on sites like LinkedIn, Twitter, and Reddit. Now, you might still get into trouble if you re-publish that data, but you can, at least, safely use the data ”internally”, and the act of scraping/crawling (politely) is not, per se, something unlawful.
7 comments

Not sure "isn't LinkedIn property" is accurate here. They still retain ownership and control of redistribution just like any other IP. This is more of a philosophical question about whether "viewing" itself is a violation of their ownership rights and really about the definitions of "viewing" and "public" in the context of the internet.

Seems like they've simply determined that viewing any freely accessible URL is "public" and that "viewing" does include scraping. This seems like a very reasonable determination as it maps pretty neatly to how we think about viewing public content IRL where I am free to drive down the road (for profit or pleasure) and record publicly viewable signage and activities and use that data any way I see fit.

> Not sure "isn't LinkedIn property" is accurate here.

It is very accurate. Users retain the copyright on their works in so far as their works are able to be copyrighted. Anything that is a "mere fact", and can't be copyrighted, is also not LinkedIn's property.

From LinkedIn's terms of service[1]:

> you are only granting LinkedIn and our affiliates the following non-exclusive license:

> A worldwide, transferable and sublicensable right to use, copy, modify, distribute, publish, and process, information and content that you provide through our Services and the services of others, without any further consent, notice and/or compensation to you or others.

___

1. https://www.linkedin.com/legal/user-agreement#rights

If I enter my employment record and my profile pic, birthdate, etc, I don't think that is the ip of linkedin. Maybe the way they display it or if they are transforming it in some way it could be considered ip. But if someone scrapes all that user entered data and then displays it somewhere else in a different format, I can't imagine LinkedIn being able to claim their ip has been infringed.
I think this all of this should be the user's choice since every company should put the user at the center of these decisions. If I want my data to be shared in any way I can simply tick a box and allow that. If I don't then keep it just for me and the people I chose to share it on that platform.

It should also be made clear to the users if that data is being used as payment for the services provided by mentioning explicitly and in a detailed way where that data goes.

I think (hope?) that's what this decision did. LinkedIn must allow scraping publicly available data, but not private data that a third party wouldn't have access to normally.
Maybe it's more accurate to say "any publicly linked URL"? IIRC, charges have been successfully brought against people for e.g. iterating through user identifiers in URLs to gain access to other users' data. (Do correct me if I'm wrong on that count!)
Andrew Auernheimer, more commonly known as weev, got all of AT&T's ipad users' email addresses at that time, by enumerating all the possible sim-card IDs, against a public facing ATT website. He was charged and convicted the Computer Fraud and Abuse Act (CFAA), and sentenced to 41 months in federal prison that. His sentence was vacated after 13 months due to a technicality of the venue; that judge did not address the substantive question on the legality of the site access.

Weev may be an odious person, but everyone has rights in a court of law, even white supremacists.

> His sentence was vacated after 13 months due to a technicality of the venue; that judge did not address the substantive question on the legality of the site access

So the way the American legal system works is:

  if(venue == correct && facts == bad) {
    guilty();
  } else {
    not_guilty();
  }

If the venue is not correct, the facts of the case are not evaluated. If you go read some lawsuits, you'll see that the first page or two is an argument about why the judge reading it is the correct judge to read it.
Generally, that is the way it works, but it is foolish to try and understand the legal system like it's software. If the venue is incorrect, the judge may more or less tell them to get lost. That's not the same as "not guilty". A lot of rules are adhered to to make sure that courts don't get gummed up with meaningless cases and to make sure that judges with the appropriate authority handle the appropriate cases.
You are right; I wanted to give a general idea. And if you've ever written software for Itanium, you'd know that relying on evaluation rules in an if statement is a dangerous thing to do!
> If the venue is not correct, the facts of the case are not evaluated.

More precisely, the facts of the case are not evaluated by that court. Usually the case will be transferred to a different venue (i.e., federal court in a different district) or dismissed and refiled in a different forum (e.g., state court instead of federal court).

In Mr. Auernheimer's case, had he been successful in his improper venue motion, he probably would have faced prosecution in either his home district or the district where the AT&T servers were located. The result of that trial might have been the same, but there wouldn't have been a vacatur.

Some kid was charged for that but in my opinion it was stupid. URL to me means part of the UX. If you search on Google using a query parameter directly instead of entering the query in their search box, should that count as wrongful use?
Stupid or not, that's a matter for the lawmakers. What I'm saying is that, as far as I know, a ruling that any publicly accessible URL is fair game would contradict previous rulings.

Now, this is based on my very patchy memory of sensationalist reporting of legal matters in a jurisdiction I don't reside in, so there's probably some wiggle room there ;)

No, it should not. But what if you try some SQL injection to do something nasty?

The modern law system distinguishes between result and intent.

If I guess your password in the password form input, should that count as wrongful use?

If I rifle through your personal papers because your door was open, should that count as wrongful use?

I think that's fine, but I also think the end-user should decide. With Google (edit: I meant Facebook) I'm able to determine whether or not I want to show up in search results. This shouldn't be an absolute is or isn't public situation.
LinkedIn already allows discreet control over your profile's public visibility along with the ability to micro-manage some of it, the URL you're looking for: https://www.linkedin.com/public-profile/settings
You can decide to not use linked in, and use a service that does not make profiles public.
"If your apple looks a little banged up, eat an orange"
Even better, the decision here is only concerning profiles of people who have elected to make that profile public. It's very simple to make your LinkedIn profile private.
The concern I had is that the court forces LinkedIn profiles public regardless of user settings. Courts sometimes go a little further. I'm sure LinkedIn will do their best to not allow private profiles to be crawled.
The challenge for LinkedIn is that they still want google to crawl them.
This is about the copyright on the items that people post, i.e. creative works, right? But what if LinkedIn collects facts (where you work, your age, etc.), wouldn't that be covered by sui generis property right (better known as database copyright)?

Does this judgement say anything about that, i.e. whether it matters that users contributed the facts in their collection (so I'm not talking about posts, descriptions, etc.) rather than that they collected it themselves and therefore get a form of property right?

Edit: wait, database copyright is not a thing in the USA. Of course they wouldn't say anything about that.

IANAL

> But what if LinkedIn collects facts (where you work, your age, etc.), wouldn't that be covered by sui generis property right (better known as database copyright)?

I don't think so.

> Under the Copyright Act, a compilation is defined as a "collection and assembling of preexisting materials or of data that are selected in such a way that the resulting work as a whole constitutes an original work of authorship." 17. U.S.C. § 101 [1]

The thing is, LinkedIn is not authoring the compilation. The individual users are.

___

1. https://www.bitlaw.com/copyright/database.html

LinkedIn may be the author of the compilation because they curate the database by removing fake profiles and encouraging users to complete their profiles. Also, the graph of connections between profiles may constitute a non-trivial organization method which takes the database out of the trivially-organized databases which were held uncopyrightable in the past. (e.g., Feist v. Rural[0])

In any case, this decision was mostly about upholding the lower court's granting of an order preventing LinkedIn from blocking hiQ's scrapers for the duration of the lawsuit. HiQ could still lose on the copyright questions or other issues.

[0] https://en.wikipedia.org/wiki/Feist_Publications,_Inc.,_v._R....

My understanding is that the contract (TOS) portion is not decided. This decision stated that Linkedin does not have a protected property interest in the profiles, so it can not claim copyright there. It's possible they could claim things like compilation copyright; that's is as yet undecided. Also, the appears court only dealt with the CFAA issue I believe; there's still the contract (TOS) to consider, as well as a possible trespass claim.

Now, the CFAA was the only criminal statue involved, so I guess that supports what you said, that scraping is not unlawful. There still may be liability though, and using the data only internally would not necessarily protect from that. It remains to be seen.

"it can be scraped/crawled, that is, it isn’t LinkedIn property"

I thought it was pretty established that putting something on a website didn't eliminate your copyright. Has that changed now?

To me, it seems like common sense would be that if you make a public website, you are implicitly permitting some copies, but surely it's not all or nothing?

Facts and tables are not copyrightable. The phone numbers in a phone book are not copyrightable, merely their presentation order[0]. If you were to copy, say, the linkedin website, or the linkedin branding, or the name linkedin, or any of their ads, those would be eligible, but the simple collection of names, emails, and phone numbers is ineligible for copyright.

0: https://en.wikipedia.org/wiki/Feist_Publications,_Inc.,_v._R....

This depends on jurisdiction though. In the European Union specifically there exists sui generis legislation that grants certain rights to the assembler of a database [1]. However, it’s a more interesting situation when the database keeper just provides a means for individuals to fill in their own data.

[1] https://en.wikipedia.org/wiki/Database_right

> I thought it was pretty established that putting something on a website didn't eliminate your copyright. Has that changed now?

No, if anything, that supports the decision.

To the extent that the material is copyrightable, it belongs to the users, who have chosen to make it public; copying incidental and necessary to that access is allowed under an implied license doctrine. Microsoft's efforts to restrict access had nothing to do with copyright, but ToS.

Perhaps it depends on intent. Clearly, the creators of the content, and those who posted the content, did so for the sole intention of making it public and usable outside the LinkedIn system. Their posting of it on LinkedIn is incidental; what site is used or who owns it is largely irrelevant to them, whereas such things clearly do matter to any company or person creating and posting their own unique content to their own site.
At one point, to fight scraping, Craigslist changed their terms so that users assigned them copyright of listings rather than just a license. It didn't work well for them, but it's an interesting approach.

https://www.eff.org/deeplinks/2013/04/craigslist-owns-what-y...

My understanding is that Facebook uses similar clauses to disallow web scraping. Does that mean Facebook is fair game too?
I'm pretty sure you would get a big GDPR fine if you start taking data people agreed to put on Linked-in without their express permission.