Hacker News new | ask | show | jobs
by john_strinlai 112 days ago
many people tend to overlook how little information is needed for successful de-anonymization.

i like to introduce students to de-anonymization with an old paper "Robust De-anonymization of Large Sparse Datasets" published in the ancient history of 2008 (https://www.cs.cornell.edu/~shmat/shmat_oak08netflix.pdf):

"We apply our de-anonymization methodology to the Netflix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers of Netflix [...]. We demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber’s record in the dataset."

and that was 20 years ago! de-anonymization techniques have improved by leaps and bounds since then, alongside the massive growth in various technology that enhances/enables various techniques.

i think the age of (pseduo-)anonymous internet browsing will be over soon. certainly within my lifetime (and im not that young!). it might be by regulation, it might be by nature of dragnet surveillance + de-anonymization, or a combination of both. but i think it will be a chilling time.

6 comments

That's a great background paper on the Netflix attack, we make a pretty direct comparison in section 5. We also try to use similar methods for comparison in sections 4 and 6. In section 5 we transform peoples Reddit comments into movie reviews with an LLM and then see if LLMs are better than naraynan purely on movie reviews. LLMs are still much better (getting about 8% but the average person only had 2.5 movies and 48% only shared one movie, so very difficult to match)
>we make a pretty direct comparison in section 5

awesome, i saw the mention in the introduction but i havent yet had a chance for a thorough read through of the paper -- ive just skimmed it. looking forward to reading it in-depth!

We don't need everyone to be completely anonymous to state and corporate actors. We just need to make it so that they can't identify and surveil everyone at once, because it would be too expensive.

The US defense budget is about $1T dollars. They can't spend it all on surveillance, but let's say tech companies + gov spends about this amount per year on surveillance in total. If we can raise the cost to surveil the average person to over $10K/yr, they just lose. This is very doable.

Every little precaution you take will raise the cost, probably more than you think. Every open-source project that aims to anonymize and decentralize is an arrow in their knee. They're hoping that you'll get cynical and stop trying because they don't stand a chance otherwise.

Unfortunately the cost for this stuff is going down. Cheaper to collect information, cheaper to store it, cheaper compute, and better algorithms that mean you need fewer resources.

If the cost to surveil the population is $10k per capita today, it'll be $1k in a few years and $100 a few years after that.

This is a war that can't be won, it's just part of the changing landscape of technology in the information era.

I don't think the cost has been doing down or will continue to trend downward long term. You're assuming that the public hasn't gained and won't gain additional capabilities while our adversaries evolve. But look at our communication reach, bandwidth, latency, and cipher strength.

How easy was it for the government to deliver mass propaganda before the Internet without the public realizing? How quickly and how many bits of information can Alice in Seattle reliably get to Bob in Houston with a strong cipher in the 1960s? Was there ever such a thing as a cipher that's widely used yet unbreakable by the state? Why do you think China banned TLS 1.3? Do you think it will be harder or easier to pretend to be a different person when there are open-source LLMs that can run on a gaming computer?

The Internet is a recent invention. Smartphones and seamless network coverage are even more recent, and so is curve25519. We're closer than ever to what is effectively secure instant telepathy with anyone in the world. We just need to stay vigilant and not be fall for doom and gloom in this last stretch.

> Does privacy of Netflix ratings matter? The issue is not “Does the average Netflix subscriber care about the privacy of his movie viewing history?,” but “Are there any Netflix subscribers whose privacy can be compromised by analyzing the Netflix Prize dataset?”

Well said.

A silver lining of the ai apocolypse is that users may be able to use the technology to maintain their anonymity via llm paraphrasing.
My guess is that a statistical analysis of other things such as access patterns, timestamps, content you engage with, etc, could de-anonymize you regardless of the phrasing you use, so LLMs won't save you.
True, but you could also use llms to autonomously engage with content you're not interested in, batch replies for times you're not around, inject coherent, consistent, plausible, but false details into your messages, or modify/flag details you didn't mean to disclose.
as the_af says, stylometry is only one technique in a bag of techniques used for de-anonymization. a big one to be sure, but nowhere near the only one.
As you say, the_af mentions this an hour before your reply. I'm curious what is the point of your posting a "me too" comment here? Was it to teach naive readers the word stylometry?
Throwaway accounts using "clever" turns of phrase can often be anonymized by double click, right-clicking -> googling their witty pun and seeing their the sole instance elsewhere, on Twitter, Facebook, etc

If I see a couple words I dont know in a row, I can infer a posters real name.

Id be more specific but any example is doxxing, literally so

If you have access to the whole site dataset it's much more reliable with simpler checks. You can just use word usage frequency of common words. Someone posted a demo here of doing this to HN comments which was very effective at showing alt accounts for a user.
I assume one's vocabulary is basically a fingerprint, even if one doesn't use unique turns of phrase. Domain knowledge just leaks in and we aren't conscious of it being identifiable.
It also geographic. There's a bunch of quizzes online where in 10 or 20 questions, it can tell you exactly what area in the US somebody is from. It comes down to the terms you use that you might not even realize are not universal. Highway vs freeway, what you call a sugary carbonated drink, and so on.

OTOH I think a lot of these methods don't matter that much because of plausible deniability. Stylometry and other stuff processes is always probabilistic, and can be dismissed.

>OTOH I think a lot of these methods don't matter that much because of plausible deniability. Stylometry and other stuff processes is always probabilistic, and can be dismissed.

while all of it is probabilistic, the issue is that the probability can quickly begin to approach 1 when multiple sources of data & varying techniques are combined.

>OTOH I think a lot of these methods don't matter that much because of plausible deniability. Stylometry and other stuff processes is always probabilistic, and can be dismissed.

i've come to realize, often the "opressor" or whatever party i imagine using this kind of thing against me, they do not care about being exactly <right. i will not be able to lawyer my way out. if something is actionable, action will be taken. and i'm not the one deciding if its actionable

MIT showed this in 13 after the government was caught illegally spying on Americans with “just metadata”: https://www.nature.com/articles/srep01376