Taking action against scraping for hire | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

	Taking action against scraping for hire (about.fb.com)
	220 points by pawelkobojek 1448 days ago

42 comments

iandanforth 1448 days ago

Collecting the rhetorical BS:

"scraping attacks"

Scraping is not an attack. Monopolists want to pretend they own your data because they get unlimited access to monetize it whereas competitors should have none.

"self-compromised"

Monopolists want to sell you thus it's imperative they maintain the fiction of "one person, one account". By admitting you own your account, they'd have to allow sharing and they wouldn't be able to provide their customers (advertisers) with reliable data about individuals.

"protect people from scraping"

Monopolists will protect themselves and call it protecting you. They will attempt to make you afraid of some other actor using your data in harmful ways so as to detract from how they monetize you and use your data in harmful ways.

"deter the abuse"

Monopolists don't want to argue about what constitutes abuse. Anything they write in their TOS is entirely for their benefit and only constrained by local law (if that). They will abuse you to the fullest extent they can get away with while arguing that any action to use your rights is "abuse."

"safeguard people against clone sites"

Monopolists want to maintain their monopoly, there is no greater threat than a direct challenge to that monopoly by allowing data to move freely.

--

More subtle but even more ironic rhetorical points

"for hire" / "paying for access"

Emphasizing that people making money (gasp) for providing this service, is bad.

"industry leader in taking legal action" + "across many platforms and national boundaries, also requires a collective effort from platforms, policymakers and civil society"

Monopolists can pay high priced marketers to rebrand them as patriotic hero figures fighting valiantly for the little guy.

pr0zac 1447 days ago

While I agree with your assessment of the BS in the article wrt scraping, and also agree with your assessment that the behaviour is completely about FB protecting itself and its monopoly control (the word control being important), I think its important to emphasize its not about FB caring whether other entities having access to the data, its about FB caring about it's public perception with regard to its having that data at all.

Over the last few years or so it feels like, to reference a @dril tweet[1], Facebook has just been 'turning a big dial taht says "data access" on it and constantly looking back at the audience for approval like a contestant on the price is right' with how much it allows 3rd parties to get at its data.

Keep in mind ~5 years ago the big thing at FB was "Open Graph" and "Graph Search" which gave everyone really in-depth access to their data with the idea that Facebook would be the "data platform" on top of which all of these 3rd parties would build apps and interfaces. This of course eventually resulted in the whole Cambridge Analytica thing and now this gigantic swing in the other direction of being overly protective of the data as a kneejerk PR reaction to all the bad press.

FB loved sharing data and provided a direct API for accessing it when the public narrative was about data freedom and 3rd party developer friendliness and it hates giving any access at all and goes around sues web scrapers now that the public narrative is all about privacy.

Facebook will happily align itself in whatever way results in the least public outcry arguing they shouldn't be allowed to have the data in the first place regardless of if that means giving access or restricting it.

1: https://twitter.com/dril/status/841892608788041732

Mo3 1447 days ago

The example you stated is a truly fantastic one. Graph Search was pretty much like a direct API into their front facing network.

nathanaldensr 1448 days ago

Great post that summarizes exactly what I feel about globocorps. The euphemisms and propaganda are disgusting.

noslenwerdna 1447 days ago

The users agreed to share their data with Facebook, not some other company. If they didn't prevent this, they'd be asking for another Cambridge Analytica

stickfigure 1447 days ago

The users agreed to share their data with everyone that uses Instagram. Because that's how the site works.

kube-system 1447 days ago

There’s an important difference between technically consenting and informed consent.

Given what I know about the bot problem on Instagram, I would imagine many people have been tricked into sharing their private profiles with scraping bots. Many bots are copying real people’s profiles and then spamming their friends with follow requests. It’s highly effective and gives these bots access to private profiles.

Fooling people is fraudulent, period.

greatgib 1447 days ago

The user agreed in facebook to have is data "public", so it can't complain that a robot scrap it.

Nothing prevents him to restrict access to his pages an data to "trusted" friends.

kube-system 1447 days ago

The description in the article sounds like it scrapes private profile data.

> Octopus designed the software to scrape data accessible to the user when logged into their accounts

Kwpolska 1447 days ago

Were they showing the private data to everyone, or just to the person whose account was used for the scraping? If it’s the latter, then this is also not a crime, it is just someone accessing data they have been authorized to access, but in an automated way.

greatgib 1447 days ago

I don't think so, it is more like you scrape what is accessible to this user. So in the end you will scrape your friends data. This is why I said that you are free to only share with friends that 'you trust'.

jasfi 1447 days ago

That is a very good point, but surely it was taken into consideration when scraping was declared legal?

stefan_ 1447 days ago

All that case says is "scraping is not a violation of the CFAA". But of course the scraped data still exists in legal limbo; maybe you can compute derived information from it, but the moment a scraper reproduces it there is all of copyright law waiting for them.

jasfi 1447 days ago

In that case, the user owns the copyright, not the company, as the user is the author. So it would be up to them to take legal action if deemed necessary.

danuker 1447 days ago

https://techcrunch.com/2022/04/18/web-scraping-legal-court/

utahcon 1447 days ago

The only argument I have here (sadly in favor of FB) is with "safeguard people against clone sites". While I did give my data to FB, I didn't approve that transfer to another site/system. That is the only place I could possibly see some legal foot hold.

asdff 1447 days ago

What happens when FB builds a shadow instagram profile of you based on your FB account? That already happens. FB clones their own data for other projects no different than what you might fear happening if this data were cloned to a third party. The cat is out of the bag already but FB wants to pretend they are the only ones with the right to abuse.

kbenson 1447 days ago

It's impossible to control information once been created. The longer it's existed and the more locations you can see it make that spread exponentially more likely.

Wehether we make that spread of informationlegal or not does little to affect whether it happens.

There are two things that might help. First, don't share as much information. Once it's no longer limited to you or your close group of friends which hopefully won't share it along with your name, it's mostly out of your control. Second, put limits (laws) on what information companies are able to synthesize about you, and how long they can retain it. If there's less information created about you (or it's ephemeral, created and destroyed as needed), and if they need to clean out older data, there's less to be shared or stolen.

kube-system 1447 days ago

“It’s hard to enforce the rule of law” is not a good reason to abandon it entirely. Data privacy laws make data privacy better even without being 100% infallible.

We should be both practicing good data hygiene and using legal tools to combat those who abuse data privacy.

kbenson 1447 days ago

> “It’s hard to enforce the rule of law” is not a good reason to abandon it entirely.

I didn't?

> We should be both practicing good data hygiene and using legal tools to combat those who abuse data privacy.

That's what I said. The first thing is data hygiene, the second is legal requirements. The difference I think is that the legal requirements should be on the actual creation and retention of the data, not just who owns it, who it can be shared with, etc.

As soon as PII information over a certain age is radioactive and linked to a fine per person, all of a sudden there'll be a lot less giant repositories of PII to worry about.

mylons 1447 days ago

they also toss in the chinese affiliation in hopes to bring even more ill will from the reader towards the company. china is probably doing some bad things, but scraping facebook ain’t one of them.

kube-system 1447 days ago

Scraping social media is something that China is very notorious for doing. They are 100% positively scraping all major social networks around the world.

They do this to collect information of foreign policy interest to them, to silence political dissidents abroad, etc.

For example: https://www.washingtonpost.com/national-security/china-harve...

And: https://www.propublica.org/article/even-on-us-campuses-china...

iandanforth 1447 days ago

Good point, I missed that one.

SergeAx 1447 days ago

I don't get the thing about "monopoly".

Let's start with one thing: copyright on databases. Take IMDb: they collect and combine totally open data on movies cast, crew, soundtracks used and so on. Everyone can go to the cinema, wait until movie ends, write down data from credits roll and put it on the database. There's no prohibition on this activity. Cinema may prohibit filming inside, but not using pencil on paper. Or you may buy a DVD released later, and do just the same. Or you may even write a movie company email asking for those data in electronic form and chances are they will send it to you or point to some promo materials website where it is published already.

But the entire database is a product of work, and that makes it valuable. So the company or organization spent time and money collecting, indexing and cross-linking those data, and has a right to bank on that work. Easily copying that database for commercial purpose _is_ stealing. This is why we have a database copyright laws.

Now back to Meta. They created this product and made it attractive enough so people are adding their data voluntary. Every single piece of data is quite open (maybe not really so for personal bits like face photos, emails and phone numbers). Meta spent a lot of cash making and keeping product that attractive, and now banks on those collected data by targeting ads.

Nothing in the world prohibits everyone else to create a service, make it valuable, attract people, collect data (according to data collection laws) and bank on that. But just copying data collected my Meta is stealing, and Meta is in its own right to protect it. The fact that Meta did it before doesn't makes it monopolist. In fact, there are lots of companies doing the same, like Google, Amazon, Apple, eBay etc. So in my opinion it is not a monopoly defending its' position, but rather business defending its' assets from stealing.

rmbyrro 1447 days ago

Missed this one:

> a US subsidiary of a "Chinese national" "high-tech" enterprise

Replacing it with "a business" would do just fine.

TechBro8615 1447 days ago

Indeed. It's the height of hypocrisy for a company to define the borders of its own system and then prosecute those who they consider in violation of them. There is no consideration given to whether the data should have been collected and retained by Facebook in the first place, regardless of whatever arbitrary access policies they defined to fit their own business and data model.

It's not clear what Facebook's position on scraping truly is. Sometimes they downplay it as "normalized and widespread," and other times they castigate it as inexplicably legal and clearly immoral, or even outright "in violation of state and federal law." For example:

- April 2021. Researchers find an exposed database containing the scraped data of 533 million facebook users. Some news reports refer to it as a "breach." Facebook attempts to downplay the issue as the result of third party scraping. Headline in ZDNet: "Internal Facebook email reveals intent to frame data scraping as ‘normalized, broad industry issue’" [0]

- October 2020. Facebook announces lawsuits against companies it claimed created a "malicious extension on Google’s Chrome Web Store designed to scrape Facebook, in violation of Facebook’s Terms and Policies and state and federal law." [1]

So... which is it? Does Facebook believe that scraping is a "broad, normalized industry issue?" Or is it a violation of "state and federal law?" It seems like they measure severity of its impact primarily based on the reactions of political commentators.

And what's the difference between automating a browser and automating an API client? Why did Facebook design an API for accessing the data they collected, if it's illegal to collect? They've even claimed to be the victim of Cambridge Analytica, who purchased a "quiz" application created by a developer who pieced it together using code straight from the "examples" section of Facebook's API documentation.

There is one obvious resolution to this apparent contradiction. If we remove Facebook from the question, then the contradiction resolves itself. All we need to do is stop presuming that Facebook has the right to collect and retain this data in the first place. And as a user, if you publish your data to a website designed for sharing it with other people, then by definition it is no longer private data. Therein lies the central question: what is "semi-private" data, and who controls its boundaries?

[0] https://www.zdnet.com/article/facebook-internal-email-reveal...

[1] https://about.fb.com/news/2020/10/taking-legal-action-agains...

p.s. another thing they never mention is why companies want to scrape lists of facebook users. perhaps it might have something to do with the "lookalike audience" feature, and its more precisely targetable predecessors, which allow advertisers to upload a list of usernames and email addresses for targeted advertising?

fxtentacle 1448 days ago

Of course, Facebook wants to make it sound like scraping is illegal, when it generally isn't.

But account hijacking and mass-creation of accounts just to access private pages are clear violations of the Facebook and Instagram ToS, so they surely can sue for that.

Raed667 1448 days ago

Violation of ToS does not mean a violation of the law.

closewith 1448 days ago

Most law suits aren't due to breaches of the law, but breaches of contract. Whether terms of service constitute an enforceable contact is another matter.

adamsmith143 1447 days ago

ToS have been around for decades, surely this question is settled by now?

marlowe221 1447 days ago

Former attorney turned software developer here!

Nope, it's not a settled question in the way that I think you mean. Each ToS is different so each would be subject to individual legal analysis in court on its own terms.

Questions would include whether the ToS is unconscionable, whether the terms violate laws of the locality/nation, and so forth.

It's the same with traditional contracts - the fact that contracts have been around for hundreds (maybe thousands) of years doesn't mean much if you and I create a brand new one between us. Our contract's specific terms (and events/actions between us as a result) would be the issue in court.

kaivi 1447 days ago

Why can't FB simply include a clause like "No kind of automated scraping is allowed, except for search engines in robots.txt"? This would save them so much time in court, arguing over the use of fake accounts which should really be irrelevant.

adamsmith143 1447 days ago

So even the general question of "Whether terms of service constitute an enforceable contract" depends on each individual ToS?

jhoelzel 1448 days ago

if a bot creates the account, who breaches the contract?

sneak 1447 days ago

The person who ran the bot. Programs do not have agency, they are just tools.

That's like saying "If the gun fires the bullet, who is liable for murder?" It's a silly question.

CSMastermind 1447 days ago

> That's like saying "If the gun fires the bullet, who is liable for murder?" It's a silly question.

I don't know I've seen several people unironically argue that it should be the gun's manufacturer.

stonemetal12 1448 days ago

That is why they are suing rather than pressing charges. When someone steals your car you don't sue them you press charges. When someone doesn't uphold their end of a contract you don't press charges you sue for breach of contract.

compsciphd 1448 days ago

in reality, you as an individual can't press charges. Only the state can. And many times the state chooses not to. You can sue in civil court, but individuals can't bring cases in criminal court.

onionisafruit 1447 days ago

You are confusing pressing charges and indictment. Pressing charges just means you accuse somebody of a crime and “press” the prosecutor to indict them. So the state does have the ultimate say on who is prosecuted, but that doesn’t mean you can’t press charges.

closewith 1448 days ago

Many countries do have the concept of private criminal prosecutions.

sneak 1447 days ago

"pressing charges" isn't a thing.

stonemetal12 1447 days ago

As far as I am aware it isn't a specific thing, but a general catchall term for going through the process of filing a criminal complaint, and seeing it through to completion. Maybe there is better words for it but "pressing charges" is what they use on TV so it is top of mind.

In general I meant there is a difference between criminal and civil law, and suing generally refers to civil not criminal law.

onionisafruit 1447 days ago

It is a thing. In America pressing charges is when you accuse somebody of a crime and ask a prosecutor to bring criminal charges against them.

sneak 1447 days ago

Prosecutors exclusively decide who is charged. No charges can be "pressed" by a victim.

CoastalCoder 1448 days ago

I don't think I know the answer, but I'm curious:

Does violating a website's TOS meant your accessing it beyond your authority, making it a violation of the US's Computer Fraud and Abuse Act?

tumult 1447 days ago

Not a violation. Decided by Supreme Court in 2021. Van Buren vs. United States. It was a big deal.

zja 1448 days ago

Violating TOS no; Gaining access beyond your authority maybe https://www.eff.org/deeplinks/2010/07/court-violating-terms-...

CoastalCoder 1447 days ago

I was assuming that in this case, a person's authority was specifically granted by the ToS.

I wondered if the interplay of those two concepts muddied the waters.

danaris 1448 days ago

I don't have a source for this, but my recollection is that this has been successfully argued by a couple of companies—but then an appeals court found very firmly that it was not the case.

Essentially, having that be true would mean that any given website could create whole new classes of criminal behavior.

zinekeller 1448 days ago

> having that be true would mean that any given website could create whole new classes of criminal behavior.

While this is true, reading the lawsuit it is clear that Meta is suing in civil court, so maybe they're trying to enforce their contract, especially their automated collection ToS (https://www.facebook.com/apps/site_scraping_tos_terms.php)?

dementiapatien 1448 days ago

Since when do you get sued for breaching TOS?

curiousllama 1448 days ago

Since you start a business on the violation.

"Since when do I get sued for taking too many free samples from Costco?" -> "Since you started taking millions of them to resell"

jhoelzel 1447 days ago

im not sure on american law, but if you give me those samples willingly i can do whatever i want with them.

Actually this is the reason why many products come with the lable "not for resale" but i have yet to find somebody who cares about it :D

treis 1447 days ago

>give me those samples willingly

Doesn't seem like Facebook is giving them willingly.

thallium205 1448 days ago

Since when do you get sued for breaching a contract? When the offense is worth it.

golemotron 1448 days ago

You can get sued for anything that causes harm.

Relevant life lesson: don't do things to people with money that they might perceive as harm.

Corollary: Being sued is as much punishment as losing a suit for most people.

contravariant 1448 days ago

I don't know but it's at least been that way since Aaron Swartz did it I suppose.

HeckFeck 1448 days ago

Data harvesting is moral for me, but not for thee.

mateuszbuda 1448 days ago

In general I agree that harvesting public data is moral. I think that in these particular cases it's: 1) extracting data from profiles that opted for not being public (only available to logged in users) and 2) reposting scraped data (publicly?) as belonging to the guy who scraped it without users consent.

kordlessagain 1448 days ago

Facebook has hidden much of Instagram's content behind logins, so that makes most of it "not public".

At the same time, I don't think all of Instagram's users care if their images are hidden, or not.

It's quite unfortunate Facebook/Meta is using hostile language and the word "scraping" together in this case. Scraping is a legitimate process used by various business models to gather information from the Web, which itself was originally intended to be an open forum for people to share content.

Hostile business models have corrupted that intent and turned it into a competitive environment that is harming users and legitimate models which may not have the funding larger corporations can muster.

I have a "scraper" I've built that will either snapshot a page from a user's browser or crawl it remotely with Selinium/Firefox, on the user's behalf, to save the content in an index for searching later, by that user. It's not automated, nor does it parse and crawl URLs in the pages saved. It doesn't use page content in a wider context, either.

I've spent a significant amount of time trying to "work around" anti-scraping efforts by various companies and it's frustrating to see hostility instead of cooperation in certain types of use.

car_analogy 1447 days ago

> Facebook has hidden much of Instagram's content behind logins, so that makes most of it "not public".

1) It was public when the content was posted by its authors. Facebook locked it down retroactively, regardless of the author's intent.

2) A login requirement doesn't make it non-public, if making an account is trivial, and there are already hundreds of millions of accounts. Is the plot of Avengers: Endgame also not public, because it's locked behind a ticket purchase or subscription?

wraptile 1447 days ago

Also login requirement is not certain. e.g. Google doesn't need to login to index those pages, neither do you for first few profiles. Only after your identity (ip or fingerprint) is know instagram starts locking public content behind login gates.

Alex3917 1448 days ago

> extracting data from profiles that opted for not being public

The tool lets you download the contact info of your friends, which you should be able to do anyway. In fact Facebook tries to trick its users into thinking they can do this with their data takeout option, but the downloaded files don't actually include any of the contact info for your contacts. Which makes zero sense, considering the entire point of Facebook is that it's a digital rolodex for storing your friends' contact info.

slightwinder 1448 days ago

From the article, it seems to be service for scrapping data you have access anyway. As long as they only handle those data to the requesting customer, whose login they used, I don't see a difference between general public, and this users personalized "public". If access is still limited to the people who have the access-rights, then I don't see a difference between accessing through the official interface, or via scrapped data.

saddlerustle 1448 days ago

Users make information available on facebook with the expectation that they are able to later control access to it (other than the obvious threat model of screenshotting, etc). This is violating that expectation and thus their privacy.

falcolas 1448 days ago

> they are able to later control access to it

This has never realistically been the case. An illusion of control is provided by facebook, but they've never really put much effort into it. For a really simple example, look at how long content remained available to the entire internet after "deletion". Sometimes it took years.

Expecting any semblance of privacy from a company who profits from using and selling your data is, if I'm being blunt, lunacy.

gfodor 1448 days ago

This is a false expectation and it’s important people learn this.

IfOnlyYouKnew 1447 days ago

They’ll stop posting in the way they currently enjoy and will, therefore, have lost some freedom. Great outcome!

In other news: your partner may also leak your most intimate secrets. I hope they do, to teach you a lesson?

Every trust can be betrayed. Why do you believe a world without trust would be better? Only because you cannot handle the nuance of different levels of trust?

Nextgrid 1447 days ago

There's no evidence of the accused scraper sharing the scraped data with anyone but the account-holder, so the privacy of their friends is still protected.

adolph 1448 days ago

The state of "opted for not being public" and 'available to any system authenticated person' seem contradictory.

I appreciate that 'system authenticated person' is a smaller set than those who can access anything publicly accessible, and that the former is a subset of the latter.

lolinder 1448 days ago

I agree with the moral argument against posting the scraped data publicly, but if someone gave my account access to their data, I don't think they have a moral right to say I can't use a script to do something private with it.

Scripts are tools, and like any tool they're extensions of the self. If it's morally okay to do it by hand, it's morally okay to do it with a script, so long as my script is respectful of server resources.

upupandup 1447 days ago

Instagram behind a login screen is public. If you say were an OnlyFans model and somebody paid for your videos, scraped them, then there would've been implicit agreement.

Sharing photos on Instagram, there is no such understanding, news outlets have been logging in to view and publish your instagram photos so.

trasz 1448 days ago

If they are being harvested it makes them public by definition. Unless there was a break-in.

bko 1448 days ago

It's their platform. Do you really want some random companies scraping your facebook and instagram posts?

logifail 1448 days ago

> Do you really want some random companies scraping your facebook and instagram posts?

Thought experiment: if you want to keep control over your data, try something radical: don't hand it to Meta/FB/IG at all

(Full disclosure, I'm neither on FB nor IG)

iandanforth 1448 days ago

Yes. I want a free and open web.

xvector 1448 days ago

Good for you. Normal people do not want posts shared privately amongst friends to become publicly available.

falcolas 1448 days ago

Then why would you ever put it on a website that generates its revenue from using and selling your data?

nathanaldensr 1448 days ago

Because you're (not you, but people in general) are dumb and overly trusting.

blantonl 1448 days ago

Because you agreed to do so under the terms of conditions of that website.

Nextgrid 1447 days ago

There's no evidence the scraper companies mentioned there are making the scraped data public or sharing it with anyone beyond the individual customer that is already entitled to access that data through the official clients.

orangecat 1447 days ago

Then you need to trust your friends, because copy/paste and screenshots exist.

ceejayoz 1448 days ago

I'd rather anyone than "just Facebook".

"Just Facebook" has made the web shittier; entire realms of essentially public, often great content hidden behind a login wall.

trasz 1448 days ago

It’s not “your Facebook”, it’s Facebook’s Facebook. You already made that data public, otherwise it would be impossible to scrap it.

ogurechny 1447 days ago

As others said, there is no “you” in the scheme. It's Facebook's data. When people access that data without paying, they are “bad guys”. When the very same people pay for it, they are “legal partners”. In both cases they can do anything with it, while Facebook can't be held responsible because of all the official agreements. So as long as there is no specifically bad publicity or money loss anything goes either way.

“You” only exist in numerous empty statements about “privacy”, “respect”, etc. If you are feeling artsy, you can make that hyped NFT thing out of those, and see whether those kilobytes of text really worth anything.

lbriner 1447 days ago

What you are claiming here is not true in Europe. If FB hold data about you, the data is still your legal right. You can have it deleted and changed if it is somehow untrue and have variou other rights too.

There is a relationship involved because ultimately as a FB user, if I don't like what they are doing, I can ask them to remove my data permanently and they must legally do that. If someone has "scraped" that data (if it is considered PID), without my permission or a legal basis to do so, they are in breach of the GDPR and can have enforcement taken against them.

I think some of these "aggregation" businesses will fall foul of this in Europe but I don't know what will realistically happen if that business does not exist in Europe and breaches the GDPR.

ogurechny 1447 days ago

This is how it works in press releases. The problem is that data protection laws were in fact lobbied by corporations either openly or behind the scenes, and focus on things like real names and passport numbers that look impressive but aren't really important for the data market. These are just put into some high security database (e.g. for billing info), and it's fine. However, the real behavioral data that costs money is shared as easy as it ever was in the form of “User ID <long number> was at the location of Wi-Fi AP ID <another long number>”. It doesn't matter that the data owner still trades all the history of activity of a certain individual, or that Wi-Fi station locations can be matched with some external database. Everything is fine as long as you don't slap someone's real name on that. And, contrary to the show social networks make, they couldn't care less about real names. Even if you trick the system by calling yourself John Doe, you still look at the specific content, and have specific contacts, you are you, and the data is the same.

I remember that about a decade ago some IT guys have paid for the common Facebook advertiser access, then targeted the ad campaigns using filters in such a way that their intersection only resulted in a single user, or just a couple of them, and were able to match those “anonymized” accounts to real ones. You didn't have to be a genius to do that. Facebook certainly knew it could be used like that. Everyone who made money on that simply agreed to use “anonymization” as a smokescreen. Later, with all the scandals, those routine operations were presented as something exceptional done by a small number of bad actors.

Nextgrid 1447 days ago

> breaches the GDPR.

Facebook breaches the GDPR all the time and manages to stay in business. GDPR enforcement is barely existent, and when it does happen, it's insufficient.

vorpalhex 1447 days ago

You published them for the world to see... so yes, presumably.

rustdeveloper 1448 days ago

“This industry makes scraping available to individuals and companies that otherwise would not have the capabilities.” - seems like web scraping companies are doing a good job :)

jhoelzel 1447 days ago

The phone charger makes engery available to individuals and companies that otherwise would not have the capabilities. ;)

theincredulousk 1447 days ago

Maybe some irony here as IIRC Facebook started as essentially a scraping company, pulling student profiles from college websites and re-publishing it for their own profit.

The scrapers have become the scrapees. The horror.

PhilipA 1448 days ago

>Octopus, a US subsidiary of a Chinese national high-tech enterprise, built a cloud-based platform designed to provide paying customers access to on-demand scraping software and services.

It is interesting as how they try to position this as a Chinese attack on them.

upupandup 1447 days ago

It must coincide with Christopher Wray's sudden claim that there is an active dragnet of sorts that is trying to subvert America from within much like the recent election interference of a former Tianmen square activist who tried to run for congress I think.

It makes me think that there are many people on CCP's dole, rich powerful famous people are somehow beholden to the CCP in some unknown way but we can all guess correctly that they are all old white men who have previously been seen with young females.

MangoCoffee 1448 days ago

it look like Zack is giving up on the Chinese market.

romanovcode 1447 days ago

I guess after Winnie the Pooh rejected to name his children for him he got sour grapes for China.

throwaway_meta 1447 days ago

People that are criticizing this probably were also critical of the Cambridge Analytica scandal, but it would be useful to compare what happened there and here.

With Cambridge Analytica:

- Facebook allowed users (with informed consent) to allow external developers to access their data and limited data about their friends, in order to build social-enabled apps.

- CA exploited this to scrape basic profile data from a large number of users. It broke the ToS by doing so (in particular by using the data for purposes different than stated)

Here the same is happening:

- people are giving a third company access to their profile, which includes access to friends' data (in fact a lot more than what the app platform allowed to do)

- the company is scraping all the data.

At the time of CA, the criticism was that Facebook didn't do enough to enforce its ToS (or maybe that the data sharing should have not been allowed in the first place? But the terms were common knowledge and the attack potential became clear only in hindsight), here people are criticizing that Facebook is in fact enforcing its ToS.

Also note that strong enforcement against scraping is one of the mandates that came from the FTC settlement.

It seems inevitable that any news about Facebook/Meta is read in the worst possible light these days, even when the criticism is self-contradictory. I would expect less superficial commentary from HN.

unosama 1447 days ago

The real reason most people were upset about Cambridge Analytica was it revealed to the public how advertising and PR companies manipulate us. The fact they violated facebook ToS is moreso the excuse for the press covering it when they wanted to write another anti-Trump piece. If you were accusing a specific newspaper of hypocrisy based on two article I might agree. But you're referring to general public sentiment, and I really don't think most people cared or were surprised about the data collection. The shock and scandal was the realization that targeted advertising campaigns and information bubbles have the potential to sway elections.

throwaway_meta 1447 days ago

I'm referring to the HN crowd, I'm not sure that can be equated to "general public sentiment".

I agree with your first paragraph, and my point is that it is not possible to argue at the same time that Facebook should share data more broadly and allow scraping, and at the same time be critical that Facebook allowed CA to happen in the first place.

If the CA scandal was a wake-up call, it appears it was not internalized enough for people to understand the implications of what they're suggesting in this thread?

carride 1448 days ago

In the early days of FB, they convinced people that pages (or some content, sorry I do not know the FB terms) could be public for anyone to view without needing to login to FB. This was very helpful for small businesses and communities. In many countries this is still the quickest place to make a public page. Though now, every small business or community page I want to visit is locked out unless I login FB. Even if I do login it is impossible to copy paste the important details of a page or post, plus the UI is as ugly as it has always been.

carride 1447 days ago

I am currently in the USA and when I visit a public FB page e.g. [1], there is a small login header, and a very big annoying footer login. I estimate 15% of the content is blocked. I had spent the past year outside USA until one month ago. When I visited the same sites while traveling outside the USA, the annoying login footer moves to the middle of the page blocking almost all content. I do not have proof at the moment, but that was my experience trying to read 95% of government, business, and community pages who are almost all on FB.

  [1] https://www.facebook.com/ParquesNacionalesdeArgentina

htrp 1448 days ago

This is different from LinkedIn v HiQ because HiQ was only scraping publicly available data that was generally accessible to the broader internet. In these two cases, the data is being scraped from FB/Insta using credentials that the client handed over or the mass creation of accounts solely for scraping purposes.

Nextgrid 1447 days ago

> the mass creation of accounts solely for scraping purposes.

Those accounts wouldn't be allowed to view private data though unless they friend/follow the person first, so they'll only still be limited to data the account holders intend to be public and available to anyone.

There's also no evidence that the scraped data was aggregated at scale or commingled in any way, so even if customers provided their actual credentials which grant them access to private data of their friends, the scraper didn't share it with anyone else but them.

squaresmile 1448 days ago

Yeah, I think this is more like the Cambridge Analytica situation.

benwad 1448 days ago

Did FB ever take any legal action against Cambridge Analytica? I can't remember anything about it and this sounds very similar to that (although back in those days FB's tools made this incredibly easy).

lesuorac 1447 days ago

No. FBs ToS at the time [1] allowed CA to do what they did.

Namely, CA didn't resell the data or give it to an ad agency.

[1]: https://web.archive.org/web/20180329131546/https://developer...

Nextgrid 1447 days ago

I wish the Cambridge Analytica FUD would stop. CA's "attack" was to setup a malicious website that convinced idiots to give it access to their Facebook account using the standard oAuth2 flow.

Did they misuse the collected data? Sure. But people granted access to that data knowingly. This wasn't really an attack in my view.

Facebook wasn’t really complicit and definitely didn’t sell/give away any data.

postalrat 1447 days ago

What would be your position the data being scraped is data the site is selectively providing google for indexing but don't provide publicly.

i_have_an_idea 1448 days ago

> After paying for access to the scraping software, customers self-compromised their Facebook and Instagram accounts by providing their authentication information to Octopus

"self-compromised" lol

clearly these people just wanted an automated way to access their own data

antonf 1447 days ago

> clearly these people just wanted an automated way to access their own data

GDPR and CCPA (and probably many other national/state privacy laws) forces facebook/instagram/etc to let you download and/or delete your data without using third party websites. Usually people self-compromise their accounts in exchange for money: https://www.buzzfeednews.com/article/craigsilverman/facebook...

pclmulqdq 1448 days ago

They have to keep the walls up on their garden so they can get maximum value from harvesting.

ok123456 1447 days ago

Remember back when facebook grew their little network by scraping your gmail contacts.

Google blocked them.

There was animus between the two companies that resulted in Facebook not making an official android app until 2010.

pid-1 1448 days ago

> scrapping attack

mohamez 1448 days ago

That cracked me up when I read it lol

almog 1448 days ago

Ironically, around a year ago I disclosed (using their White Hat bug bounty program) that I'm able to access recruitment data (candidates details mostly) using very cheap form of scraping against a 3rd party service provider, they dismissed it and instructed me to report it to the 3rd party that operates that service (which I did beforehand but the issue has had not been fixed).

Sorry for being vague here, I haven't publicly disclosed it yet, but will probably have to if it don't get fixed.

nicholasjarnold 1447 days ago

Funny story from the early days of TheFaceBook, probably around 2005ish:

I was a webmaster of a set of servers on a major university's network. I also had access (enough to run arbitrary programs that had pretty much full ingress/egress to the public internet) to a number of machines across the campus's network. Through some of my coursework and ACM chapter activities I met some other similarly minded technical people with similar levels of access.

We decide that it would be fun to use our superpowers (access + programming abilities + curiosity) to sign up for various accounts on FB and essentially scrape and friend as much as possible. At the time they had some rate limiting, some IP banning (which wasn't terrible because the Uni gave public IPv4 addrs to all machines on campus by default) and then added some early CAPTCHA which we ended up breaking pretty trivially with some python and image recognition code.

Never got sued... :) Never really did much with the scripts or data except test that they worked. Fun times.

cosmiccatnap 1448 days ago

I would consider this appropriate if one of the largest offenders of scrapping weren't the one pretending to be the offended.

paultopia 1448 days ago

"Scraping attacks" LOL

sophacles 1448 days ago

Why not? weev was put in jail over incrementing a number in a url. Surely writing software to put values into urls is even worse.

sneak 1447 days ago

Let's be clear and accurate: technically weev was put in jail for conspiring on IRC with JacksonBrown. JacksonBrown was the one who wrote a PHP script that incremented a value in a URL (and appended a valid Luhn check digit following incrementation).

Conspiracy to access a protected computer system - that is, typing on IRC. weev didn't write any of the code or access the API.

samsoftstuff 1448 days ago

It's like they don't know that courts just made it legal: https://techcrunch.com/2022/04/18/web-scraping-legal-court/

brushfoot 1447 days ago

From the article: "[T]he Ninth Circuit reaffirmed its original decision and found that scraping data that is publicly accessible on the internet is not a violation of the Computer Fraud and Abuse Act."

The key phrase is "publicly accessible." This wasn't that. The scraping was done by automating Facebook accounts, which have terms of service, which forbid scraping.

ToS/EULAs make a big difference. They're the reason Blizzard could shut down bnetd's StarCraft server. They're why no one can legally reverse engineer Oracle to create a drop-in replacement, despite interoperability provisions.

More and more platforms are putting the majority of your user-generated content behind auth walls with ToS because that's how they prevent competitors from swiping it.

EMIRELADERO 1447 days ago

> ToS/EULAs make a big difference. They're the reason Blizzard could shut down bnetd's StarCraft server. They're why no one can legally reverse engineer Oracle to create a drop-in replacement, despite interoperability provisions.

Strictly referencing EULAs for user-owned copies of software here, not ToS:

That is not true. The Blizzard court clearly erred in not considering unconscionability when analyzing the EULA. As for Oracle, the interoperability provisions are what overrides that part of the EULA.

Nextgrid 1447 days ago

Does it go into detail about the actual meaning of "publicly accessible"? Because most content on Facebook/Instagram requires any valid login (as opposed to a specific account) and that data people intend to be public (especially on Insta).

In this case, the account requirement would be a technicality and the data, for all intents and purposes, would still be considered "publicly accessible" if anyone with an account can access it.

upupandup 1447 days ago

Putting a login screen that any public member can bypass isn't private information. Private info would be Onlyfans videos. So far there is no such feature on Instagram

blantonl 1448 days ago

"Legal" doesn't make it ethical, nor does it shield you from liability if you willfully violate contract law (terms of service)

Nextgrid 1447 days ago

So much bad faith in this press release but not surprising from such a disgusting company, with of course some China-related fear-mongering despite no evidence of wrongdoing.

> After paying for access to the scraping software, customers self-compromised their Facebook and Instagram accounts by providing their authentication information to Octopus.

They didn't "self-compromise" their account. They trust Octopus to act on their behalf, and unlike Facebook, Octopus' interests are most likely more aligned with their users' since their service is paid. This is no different from handing your Facebook credentials to your social media manager or secretary. There's no evidence that Octopus misused this access in any way.

> Octopus designed the software to scrape data accessible to the user when logged into their accounts, including data about their Facebook Friends such as email address, phone number, gender and date of birth, as well as Instagram followers and engagement information such as name, user profile URL, location and number of likes and comments per post.

This is either information people intend to be public or information they trust their friends to keep private. Now if Octopus was leaking the private information to third-parties it would be one thing, but so far I see no evidence Octopus was disclosing the scraped information to anyone but their customer (who is already authorized to access it).

> Meta is an industry leader in taking legal action to protect people from scraping and exposing these types of services

Translation: Meta is an industry leader in protecting its disgusting business model that hinges on making public data behind a walled garden with an unacceptable "privacy" policy. There wouldn't be a market for Octopus (or other scrapers) if Facebook already allowed customers to efficiently access information they're already entitled to, but that would be against their interests as their entire business hinges on information being held hostage.

They've created a problem, are selling the cure (well in this case monetizing it via ads) and are now pissed off that someone else is selling the cure for cheaper.

Litost 1447 days ago

Anyone else heard of Tim Berners-Lee's idea of hosting your data in pods outside the relevant corps wanting access to it and you controlling what's shared and how? This is such a completely different way of doing it, I'm not sure of all the implications, be that from admin (how much effort) to security (would this be a massive hacking opportunity) etc. https://www.theregister.com/2022/01/20/tim_bernerslee/

allenleee 1448 days ago

Ironically, Octopus reminds me of "Octopus VR" in the Silicon Valley show.

https://www.youtube.com/watch?v=ltFB4WBdDg4

mothsonasloth 1447 days ago

"It's a water animal"

viburnum 1448 days ago

One of Facebook’s earliest acquisitions was a scraping company called Octazen.

dangerlibrary 1448 days ago

Fingers crossed they eventually get around to suing Clearview AI out of existence.

https://www.nytimes.com/2020/01/18/technology/clearview-priv...

oxff 1448 days ago

Pretty rich idea coming from FB, lol. They do human scraping.

trasz 1448 days ago

We need to update the law to make sure Meta loses in cases like this.

jmyeet 1448 days ago

I'm torn on Web scraping because the extreme of each end of the spectrum on this issue both seem unreasonable.

On one side, you have people who say any form of scraping is be disallowed, even prosecutable. This went so far that the Department of Justice on behalf of AT&T prosecuted a case of URL modification [1]. One of the few bright spots for this psychotic Supreme Court was to curtail the government's power under the CFAA by limiting what constituted "unauthorized" access [2].

On the other hand, there are those who think that any level of scraping should be fine and I think that's untenable too. Consider Yahoo indexing of Stack Overflow [3]:

> In the meantime, since Yahoo (via Slurp!) is about 0.3% of our traffic, but insists on rudely consuming a huge chunk of our prime-time bandwidth, they’re getting IP banned and blocked.

Do these "scraping extremists" think such actions should be illegal? It's actually not that far-fetched given the Ninth Circuit decided LinkedIn wrongly blocked HiQ scraping [4]. Like if you change your website with the intent that it'll make scraping more difficult, is that a problem? What if it's an unintended side effect?

Additionally, companies like Meta, Google and Apple are going to be way more acountable to abiding by data retention laws and regulations than any scraper. If it's OK to scrape FB.com completely, that information is out there forever.

I certainly think the government shouldn't prosecute on behalf of companies. At least that should expose to people how the government's #1 priority is in fact to protect the true constituents: corporations and the capital-owning class.

[1]: https://www.techdirt.com/2013/09/30/dojs-insane-argument-aga...

[2]: https://en.wikipedia.org/wiki/Van_Buren_v._United_States

[3]: https://stackoverflow.blog/2009/06/16/the-perfect-web-spider...

[4]: https://blog.ericgoldman.org/archives/2019/09/ninth-circuit-...

ConstantVigil 1448 days ago

> So much about this case is ridiculous, and it’s complicated by the fact that nearly everyone agrees that weev is a world-class jerk. But, you need to separate that out from the details of what he did here, to note that it was nothing particularly special, and it involved the sort of thing that security researchers do all the time, and which all sorts of non-security researchers do quite often.

Yeah... uhm... I used to do exactly this sort of thing...

When I was a teenager, I would look at the URL of whatever site I was on, and would change a number here, or a letter there; and see what I got.

Sometimes you get nothing, sometimes you get something. Sometimes that something is quite interesting.

romanovcode 1447 days ago

> Meta is an industry leader in taking legal action to protect people from scraping and exposing these types of services, which provide scraping as a service across multiple websites.

Sure, as long as Meta is not the one selling the data to Cambridge Analytica it's wrong.

xvector 1448 days ago

HN is hypocritical - most commenters here are against this because "Meta bad," but at the same time, most commenters wouldn't want their posts shared privately amongst friends to be scraped and made available publicly.

oefrha 1448 days ago

> most commenters wouldn't want their posts shared privately amongst friends to be scraped and made available publicly.

Where's the "posts shared privately amongst friends made public" part? There are two cases here:

1. A service that logs in as the customer (who voluntarily provide their credentials) and scrapes information visible to said customer on their behalf. Nothing about "made available publicly" is alleged.

2. An individual using a pool of bot accounts to scrape posts visible to any logged in user. Nothing about "shared privately" is alleged. To be clear I don't like the method, but I'll also have to admit I've used one of the Instagram "clone sites" in the past thanks to their login wall.

Unless I missed something, it sounds like you just made it up.

mpeg 1448 days ago

For that to happen, one of your friends would have had to willingly allow this tool to scrape their social network, which would include your private posts.

Is the scraper to blame here, or the friend?

ogurechny 1447 days ago

As many other people, you are calling something “private” when it is not.

“Privately shared with friends” used to mean that only you and your friends know something. You don't “share” anything with “friends” on a social network. You give the information to a giant corporation. If it finds it suitable, it then delivers it to other users, but only after it records your location, analyzes the content to check if you were, say, affected by some melodramatic event (and therefore should be tricked into spending more time… I mean, get “personal recommendations” for a certain kind of content), and does a billion other things.

If you consider that this is fine, please relay all your conversations with family and friends through me from now on. I offer secure, reliable, fast, yada yada communication service. And it's hip! Ask anyone on the street what they use.

pawelkobojek 1448 days ago

There are two cases they brought up, one being web scraping and the other is making a clone website publicly displaying content from Instagram.

I think Meta might be mixing up these two cases here on purpose to make it look like web scraping is as bad as stealing photos to publish it on a clone website.

postalrat 1447 days ago

Who is scraping their private messages? Themselves or their friends?

Komodai 1448 days ago

lol maybe if you don't want that happening you shouldn't be using Facebook

throwaway5959 1448 days ago

Wasn’t Meta stealing news articles and not paying news organizations for them?

NelsonMinar 1447 days ago

Octopus sounds really useful; is there an open source equivalent? I'd love to be able to scrape my own data on Facebook. Their data export feature is fairly good but far from complete.

typon 1447 days ago

Google has turned Google Search into a walled garden by scraping people's content and serving it up on their own platter. Is anyone going to stand up to them?

dmje 1447 days ago

Or Facebook could just open up their data. Oh wait, not their data, silly me. Everyone else's data. Keep on scraping, I say.

rmbyrro 1447 days ago

The fact they're wasting time on that is a sign that Facebook decay phase has already started.

upupandup 1447 days ago

whoa wasn't there somebody on HN that ran a web scraping shop that were boasting they can scrape instagram a while back? are these the same guys???

I don't know how far Facebook can get with this, thought Linkedin's court ruling made scraping legal de-facto

jascii 1448 days ago

So, Facebook doesn't want to share the data it wants us to share with them? Figures...

postalrat 1447 days ago

Hey instagram/facebook/linkedin/etc: It's not your data.

samsoftstuff 1448 days ago

It's like they don't know that courts made it legal: https://techcrunch.com/2022/04/18/web-scraping-legal-court/

neya 1447 days ago

Evil Big Co. that literally STEALS people's personal information everywhere they go even after they've indicated they want to be left alone is now offended when someone does the same to them?

Well, color me surprised /s

Fuck Facebook. Meta. Or whatever you want to call it.

Hedepig 1448 days ago

Is this much different from LinkedIn vs hiQ?

nojito 1448 days ago

Logged in vs not logged in data.

logifail 1448 days ago

> Logged in

Is this actually private data, or is it public stuff that's become annoyingly hard to view anonymously because Meta chose to stick it behind a login box?

cupofpython 1447 days ago

>public stuff that's become annoyingly hard to view anonymously because Meta chose to stick it behind a login box

this one

nojito 1447 days ago

Anything behind a login gate is private data for that registered user only.

logifail 1447 days ago

> Anything behind a login gate is private data for that registered user only

That's quite the claim, if only the login gate were either always there or indeed always not.

Presuambly such "private" data ought not to be being indexed by search engines and returned to users who search?

"site:instagram.com" is of the order of 228 million pages on google.com, and "site:facebook.com" is another 422 million.

nojito 1447 days ago

pretty sure you get hit with a login gate if you navigate to the results via site:instagram.com no?

Nextgrid 1447 days ago

Depends if another user can also access it, or whether the original author/owner of the data in question intends for it to be public. In Facebook's case, there are permission levels you can set on posts, including a "public" option (which isn't actually public though and will require a login anyway, but it can be any login) which would settle that debate quickly - hell I wouldn't be surprised if that option were to be hidden as to not acknowledge that a particular bit of data was explicitly posted for everyone to see.

logifail 1447 days ago

> In Facebook's case, there are permission levels you can set on posts, including a "public" option (which isn't actually public though and will require a login anyway, but it can be any login)

Q: Have you tried this?

In a private browser session I started at google.com, searched for "site:facebook.com nextgrid", picked some random post, click through, and was reading the post without anything other than seeing FB's cookie banner. No sign of any login (which is good 'cause I don't have one)

upupandup 1447 days ago

but you make it public for everybody with the publicly accessible login so it wouldn't be considered private data for the same reason news outlets can use your instagram images and share it widely without your permission.

you can't throw up a login screen but then allow people to post themselves that ends up in public domain because the login does not distinguish from public or permissioned user authorized to view your selfie pics.

throw20220707 1448 days ago

From GDPR point-of-view this kind of 3rd party data collection is not acceptable (assuming it covers personal information, for example names of people and what they have posted). The difference with Meta's own data collection is that the users have relationship with Meta and users have given their permission for Meta to handle the data. Users also know they can contact Meta and ask them to remove the data.

3rd parties don't have the consent from users. Users don't even have an idea these companies might be holding their data.

Nextgrid 1447 days ago

From a GDPR point of view the scraper would be acting as a data processor on behalf of their customer, no different from using a cloud storage service for your contacts. It's fine as long as the third-party doesn't misuse the scraped data or share it with third-parties and there's no evidence they did so in this case.

danuker 1447 days ago

> and there's no evidence they did so in this case.

Indeed; the users probably wanted to make the data public, if scraper accounts could see it. There is a GDPR allowance for data "manifestly made public by the data subject".

https://gdpr-info.eu/art-9-gdpr/

Here, it's just Facebook wanting to keep the data inside a walled garden.

For the same reason, I quit LinkedIn and made my own site. I don't want people to have to sign in to see my profile.

uhtred 1447 days ago

Fuck off Facebook you scumbags

Komodai 1448 days ago

Is it Octopus Data Inc. aka Octoparse they are suing?

jacooper 1448 days ago

They are will using fb.com domain? I though meta is not FaceBook?....

Silica6149 1448 days ago

I think it's like Google vs Alphabet. Alphabet is the parent company like Meta.

As for why their domain is facebook for their news site, not sure why. It would make for sense for it to be under meta instead.