Hacker News new | ask | show | jobs
by rdoherty 128 days ago
Skimming the list, looks like most extensions are for scraping or automating LinkedIn usage. Not surprising as there's money to be made with LinkedIn data. Scraping was a problem when I worked there, the abuse teams built some reasonably sophisticated detection & prevention, and it was a constant battle.
6 comments

In order to create the data source that LinkedIn's extension-fingerprinting relies on to work, someone (at LinkedIn*?) almost certainly violated the Chrome Web Store TOS—by (perversely*) scraping it.

* if LinkedIn didn't get it from an existing data source

Programmers don't appreciate the fact that you can just violate terms of service. You can just do it. It's okay. The police won't come after you. Usually.
I think the point is more "in order to prevent people from scraping their site, which is against their ToS, they scraped some other site, against its ToS".
Read "in order to have more money, I did things that caused other people to have less money"
When someone who sees the world through a lens of morality notices somebody operating without morality, it is startling.

And it deserves a call out! The benefits to being so cynical that you’re numb to it come with a lot of tradeoffs

Indeed. I read a lot of comments like these one you are responding on HN. It seems like there is a type of person who thinks that writing down what their rules are has some magical power.

“This isn’t what it was intended for”. Who cares?

A long long time ago in a galaxy far far away I would encounter warnings on pirating websites saying “If you are an FBI agent you are not allowed to continue on this site”. Imagine their utter disbelief and shock if they were to be arrested by an FBI agent that clicked past the warning anyway.

I agree is must be programmers as a type that like rules a lot and, they think, what a perfect world it could be if people would follow them.

I'd ask who you think you have me confused for or where you got that quote from, but I know how little it matters insofar as getting you to recognize whatever delusion led to your comment.
I am sorry, I wasn't reacting to you I was reacting to the commenter who said:

"Programmers don't appreciate the fact that you can just violate terms of service."

> comments like these one you are responding

That's my comment.

3000 extensions is few enough that a small team could download each extension manually over a few months. You don't need to scrape at all.
In the first place, no one said they needed to, only that they probably did.

Secondly, it's not "3000 extensions". They didn't somehow magically divine that the 2953 (+/-47) extensions we see here were the ones that they needed to download in order to be able to exploit the content-accessible resources described in their extension manifest. They looked at a much larger set, and it got filtered down to these 2953 that satisfied the necessary criteria.

Lol no, did you even read the list? You could pay someone to just search "LinkedIn" and "talent" and "recruiting" on the chrome web store and download each extension. It's probably harder to automate this than it is to do it manually. This is something you could develop in an afternoon and pay a small team of people to do for pennies on the dollar. Even ten thousand extensions is nothing. Spread that over years and this is trivial.
For someone choosing to be so obnoxiously condescending, you are excruciatingly stupid.
a problem for linkedin != "a problem". The real problem for people is the back room data brokering linkedin and others do.
from the code doesn't look like they do anything if they have a match, they just save all the results to a csv for fingerprinting?
"The code" here you're referring to (fetch_extension_names.js[1]) isn't and doesn't claim to be LinkedIn's fingerprinting code. It's a scraper that the researcher behind this repo wrote themselves in order to create the CSV of the data that they're publishing here.

LinkedIn's fingerprinting code, as the README explains, is found in fingerprint.js[2], which embeds a big JSON literal with the IDs of the extensions it probes for. (Sickeningly enough, this data starts about two-thirds of the way through the file* and isn't the culprit behind the bulk of its 2.15 MB size…)

* On line 34394; the one starting:

    const r = [{
                id: "aacbpggdjcblgnmgjgpkpddliddineni",
                file: "sidebar.html"
1. <https://github.com/mdp/linkedin-extension-fingerprinting/blo...>

2. <https://github.com/mdp/linkedin-extension-fingerprinting/blo...>

thanks, my fault for not reading the read me and just doing a quick read of the code.
By looking the list it seems like it is not really “sophisticated”. It is just list based on names (if there is a “email” in the name). Majority of extensions do not even ask for permissions to access linkedin.com.
I had the pleasure of scraping LinkedIn for a client. Great fun.
Wont someone think of poor little LinkedIn, a subsidiary of one of the largest data brokers in the world?
Why frame what you are trying to say like that? Businesses of all sizes deserve the ability to protect their businesses from abuse.
Do they respect my data? Why do they get to track me across sites when I clearly don't want them to but someone can't scrape their data when they don't want them to. Why should big companies get the pass but individuals not? They clearly consider internet traffic fair game and are invasive and abusive about it so it is not only fair to be invasive and abusive back, it is self defense at this point.
They don’t need to track your web browser when they’re owned by Microsoft, because they track every action at a lower level.
Weird, I don't use Windows as an OS but have linkedin. I'd believe the concern and disregard of Linkedin's concern is fair game.
What lower level? Microsoft owns internet?
The operating system. For example see the Windows 11 screenshot debacle/scandal.
“They” is an in incredibly useful tool.
You do realize anti-scraping measures are one way of protecting your data too?
In this context, "protecting" means the interest of linkedin who aggressively sells the data. Users that give data to linkedin are not protecting their data either way.
Because you signed up to a set of terms and conditions saying LinkedIn can use your data in this way
What if I signed up before those ToS said they could use my data in this way?

Oh right, companies change ToS and EULA and "agreements" without notice, without due process, and without recourse.

I have no problem changing how I use "their" data in such situations.

> Oh right, companies change ToS and EULA and "agreements" without notice, without due process, and without recourse.

Companies change their terms of service all the time. They usually send emails about it.

I've responded to decline them a handful of times and asked for my account to be deleted. I chuckle slightly at the work it creates, but sometimes it has been easier to close an account that way.

No one likes paying taxes but they still do it. They could just not work and not have money and therefore not need to pay tax.
Except what you have to pay each year for the privilege of staying in "your" house.
I didn't want the web to turn into monolithic platforms. I abhor this status quo.

You cannot function without these enterprises, but that doesn't mean they're ideal or even ethical.

Microsoft wins because of network effects. It's impossible to compete. So I think it should be allowed to assail their monopoly here by any means. It's maximally fair for consumers and for free markets.

Ideally capitalism remains cutthroat and impossible to grow into undislodgeable titans.

Even more ideally, this would become a distributed protocol rather than a privately owned and guarded database.

That doesn't actually mean anything
I think they framed it this way because they don't consider scraping abuse (to be fair, neither do I, as long as it doesn't overload the site). Botting accounts for spam is clear abuse, however, so that's fair game.
No, I consider all data collection and scraping egregious. From that perspective, LinkedIn is hypocritical when Microsoft discloses every filesystem search I do locally to bing.
Are you not scraping a site with your eyeballs when you view a site?
By that logic I can charge you for looking at me.
When they scrape, it’s innovation. When you scrape, it’s a felony.
I'm sure there are issues with fake accounts for scraping, but the core issue is that LinkedIn considers the data valuable. LinkedIn wants to be able to sell the data, or access to it at least, and the scrapers undermine that.

They could stop all the scraping by providing a downloadable data bundle like Wikipedia.

thinking more about, I don't think its a terrible thing that they prevent scraping. Their listings are already suffering from being flooded with garbage applications and having to sift through tons of noise. allowing scraping would just amplify that and make the platform almost entirely worthless.

I "scrape" linkedin in a roundabout way for personal use, and really what Ive found is that i should just maybee not bother at all. I can't get through the noise even when im applying at places that heavily match my skillset, and just get automated rejection emails.

LLMs scrape Wikipedia all the time, or at least attempt to.

The data bundle doesn't help that at all.

That's true, the normal scraping would still happen, but it would eliminate this side business of trying to re-sell LinkedIn's data.
What is abuse? Is it anything that reduces my profit margin? Or is it anything that makes the world a worse place? The Flock CEO called Deflock terrorism, is he right?
this exchange -- obvious critical / perhaps insurrection speech versus a stable voice of business economics -- should be within the purview of an orderly and predictable legal environment. BUT things moved quickly in the phone battles. Some people say that the legal system has never caught up to the data brokering, and in fact the surveillance state grew by leaps and bounds.

So, reasonable people may disagree. This is a fine place to mention it .. what if individual profiles built at LinkedIn are being combined with illegitimate and even directly illegal surveillance data and sold daily? Everyone stand up and salute when LinkedIn walks in the room? there has to be legal and direct ways to deal with change, and enforcement to complete an orderly and predictable economic marketplace.

>BUT things moved quickly in the phone battles. Some people say that the legal system has never caught up to the data brokering, and in fact the surveillance state grew by leaps and bounds.

Partially by discrepancy in how responsive you can be or comprehensive you must be to win the next round of cat-and-mouse, and partially because a private/corporate surveillance apparatus is useful to a government that might otherwise be hampered by constitutional bounds.

We enjoy the fruits of an LLM or two from time to time, derived from hoards of ill gotten data. Linkedin has the resourses to attempt to block scraping, but even at the resource scale of LI I doubt the effort is effective.
I am not denying that scraping is useful. If it wasn't people wouldn't do it. But if the site rules say you aren't allowed to scrape, then I don't think people should be hostile towards the people enforcing the rules.
Well, they can try to enforce the rules; that's perfectly fair. At the same time, there are many methods of "trying" which I would not consider valid or acceptable ones. "Enforcing the rules" does not give a carte blanche right to snoop and do "whatever's necessary." Sony tried that with their CD rootkits and got multiple lawsuits.
the abuse>using the information they publish to the public
Yes, until it becomes abusive and malignly affects innocents.
The big social media businesses deserve a Teddy Roosevelt character swooping in and busting their trusts, forcing them to play ball with others even if it destroys their moats. Boo hoo! Good riddance. World's tiniest violin.

This is a popular position across the aisle. Here's hoping the next guy can't be bought, or at least asks for more than a $400M tacky gold ballroom!

I mean, regardless of who they are or even if you don’t like what LinkedIn does themselves with the data people have given them, the random third parties with the extensions don’t additionally deserve to just grab all that data too, do they?
Surely they do! The data is in the public internets, aren't they?
They'd put Widevine or PlayReady DRM on the website if they could, I'm sure.
why can't they?
because they're only for video files?
I say the same thing about my start menu sending every action I perform to bing.
Eh. I worked at a company which made an extension which scraped LinkedIn. We provided a service to recruiters, who would start a hiring process by putting candidates into our system.

The recruiters all had LinkedIn paid accounts, and could access all of this data on the web. We made a browser extension so they wouldn’t need to do any manual data entry. Recruiters loved the extension because it saved them time.

I think it was a legitimate use. We were making LinkedIn more useful to some of their actual customers (recruiters) by adding a somewhat cursed api integration via a chrome extension. Forcing recruiters to copy and paste did’t help anyone. Our extension only grabbed content on the page the recruiter had open. It was purely read only and scoped by the user.

Doesn't sound like your operation was particularly questionable, but I can imagine there must be some of those 3,000 extensions where the data flow isn't just "DOM -> End User" but more of a "Dom -> Cloud Server -> ??? -> Profit!" with perhaps a little detour where the end user gets some value too as a hook to justify the extension's existence.
I started their but it felt like a dodgy way (as it could be seen to be illegal). We then just went aloffical and went through Google search API’s with LinkedIn as the target. Worked a treat and was cheaper than recruiter!!!

So when pay the highest scraper, it’s ok! Same data, different manner.