Hacker News new | ask | show | jobs
by Johnbot 58 days ago
A lot of geolocation data on the market is anonymized, following medium-lived unique IDs that aren't able to be mapped to other identifiers. The problem with that is that if you have precise locations, or enough samples that you can apply statistics to find precise locations, in many cases you can de-anonymize the IDs. You can purchase address and resident listings from a number of different data vendors, and by checking where the device returns to at night you can figure its home address. Then if you find information on the residents (work locations, schools, etc.), you see if said device goes where each resident of the home address is likely to go, and you now have a pretty good idea of exactly who the device belongs to.
10 comments

There is no such thing as anonymized location data when you have the location of something where and when they sleep and work.

It's a rhetorical fiction the ad industry tells itself.

Right, there's probably no other phone in the world that typically stops for hours within 1000 feet of my bed and typically stops on Monday-Friday within 1000 feet of my work-desk.
Now think what Lavrenti Beria and an LLM could have done with that.
Somebody once said that if Stalin had access to television, he would never have to kill 20+ million ppl. What would he do with all that data? No idea.
If all you've got is full political power and control over propaganda networks, your won't get the USSR. You'll get Hungary between 2010 and 2026. It works well, but in the critical moments when things start going wrong you need to kill people to maintain power, or else your nascent autocracy collapses as quick as Orban's.
I'm no fun of Stalin, but this meme about 20+ million victims needs to be purged.

"The scholarly consensus affirms that archival materials declassified in 1991 contain irrefutable data far superior to sources used prior to 1991, such as statements from emigres and other informants.

Before the dissolution of the Soviet Union and the archival revelations, some historians estimated that the numbers killed by Stalin's regime were 20 million or higher. After the Soviet Union dissolved, evidence from the Soviet archives was declassified and researchers were allowed to study it. This contained official records of 799,455 executions (1921–1953), around 1.5 to 1.7 million deaths in the Gulag, some 390,000[ deaths during the dekulakization forced resettlement, and up to 400,000 deaths of persons deported during the 1940s, with a total of about 3.3 million officially recorded victims in these categories. According to historian Stephen Wheatcroft, approximately 1 million of these deaths were "purposive" while the rest happened through neglect and irresponsibility. The deaths of at least 5.5 to 6.5 million persons in the Soviet famine of 1932–1933 are sometimes included with the victims of the Stalin era." [0]

https://en.wikipedia.org/wiki/Excess_mortality_under_Joseph_...

lol naturally the criminals were obsessed with honestly keeping comprehensive official records of their misdeeds
> I'm no fun of Stalin

I would argue for the generality of this characterization

Only thing better to rule with is a network connected telescreen that monitors and issues orders to the proles.
So Instagram and TikTok?
Pretty sure it would be hard to enslave these people through television
Would it be? I'd argue the current US administration is entirely propped up by television. Hell, the president seems to "rule" based on what Fox News said last night.
I’m pretty sure most phones have a higher location accuracy than 1000 feet.
And with LLM’s now it’s easier than ever to piece the parts together. Companies were doing it before we even knew what LLM’s were capable of.

Edit: It's a rhetorical fiction the ad industry tells us.

I think this begs the question of what anonymous data means. Sure my visit to HN is "anonymous" in that it doesn't say "abustamam visited this site" but piece together all the other visits that have my "anonymous ID" then eventually it paints a pretty nice picture of who I am.
Does it map to a single, identifiable person or something close enough that the distinction is meaningless?

Then it's not anonymous.

Simple as that.

My point is that even completely anonymous data that conforms to what you just said can easily become de-anonymized when contextualized to other "anonymous" data.
A marketer's definition of anonymized is worthless. It's a fantasy they want everyone else to believe in.

If it can be "de-anonymized" then it was never anonymous to begin with.

"De-anonymized" is quite literally an oxymoron.

> A marketer's definition of anonymized is worthless. It's a fantasy they want everyone else to believe in.

I'm using your definition.

> Does it map to a single, identifiable person or something close enough that the distinction is meaningless?

Also

> If it can be "de-anonymized" then it was never anonymous to begin with.

Well sure, that's the point I was trying to make in my rhetorical question above. Individual pieces of data may be "anonymous" but put together with other anonymous data that can be traced to a single source and suddenly you can figure out quite easily who this person is. The data itself is still technically anonymous but it can be pieced together.

Does that mean that no non-post-quantum encryption was ever actually encryption because in 20 years someone will be able to decrypt things?
We should have learned this lesson 20 years ago when researchers were able to deanonymize a lot of the Netflix Prize dataset, which contained nothing except movie ratings and their associated dates.

https://arxiv.org/abs/cs/0610105

If movie ratings are vulnerable to pattern-matching from noisy external sources, then it should be obvious that location data is enormously more vulnerable.

> In contrast to previous attacks on micro-data privacy [22], our de-anonymization algorithm does not assume that the attributes are divided a priori into quasi-identifiers and sensitive attributes. Examples include anonymized transaction records (if the adversary knows a few of the individual's purchases, can he learn all of her purchases?), recommendation and rating services (if the adversary knows a few movies that the individual watched, can he learn all movies she watched?), Web browsing and search histories (12], and so on. In such datasets, it is impossible to tell in advance which attributes might be available to the adversary;

Is Location data highly dimensional though?

exactly. calling it 'anonymized' is pure security theater once you have enough data points to map out someones daily routine.

waiting for legislation or eulas to fix this is a lost cause since adtech always finds a loophole. the fix has to be architectural. moving toward stateless proxies that strip device identifiers at the edge before they even hit upstream servers. if the payload never touches a persistent db there is literally nothing to de-anonymize. stateless infra is the only sane way forward

To be honest, I feel like this is where iOS and Android are failing us. Why is every app allowed to embed a bunch of trackers? Only blocking cross-app tracking on user request as iOS does is not enough (and data of different apps/websites can be correlated externally).
Because we don’t enforce antitrust law in this country and the people that make those decisions profit from the ads.
> To be honest, I feel like this is where iOS and Android are failing us. Why is every app allowed to embed a bunch of trackers? Only blocking cross-app tracking on user request as iOS does is not enough (and data of different apps/websites can be correlated externally).

Even if Google and Apple both want to commit to fighting this, it becomes a game of whack-a-mole, because there are all sorts of different ways to track users that the platforms can't control.

As an easy example: every time you share an Instagram post/video/reel, they generate a unique link that is tracked back to you so they can track your social graph by seeing which users end up viewing that link. (TikTok does the same thing, although they at least make it more obvious by showing that in the UI with "____ shared this video with you").

im not sure about allowed. perhaps required may be closer.

why would someone include tech that makes people think twice about using the app, unless it is required if you want to "sell" in a particular venue.

if your developing geolocation based apps, location tracking is a core function.

a calender, absolutely does not require location tracking beyond what side of the prime meridian are you on.

> if your developing geolocation based apps, location tracking is a core function.

But the subsequent sale of that data is not—is the discussion here.

and the reason why that data is available for sale, starts with forced collection of data, if you want to participate in an app store as a developer.

you cant sell what you dont have unless you lie lower than a rug.

fix the data collection problem and a second order effect of no data for sale emerges.

Are you suggesting Android/iOS app developers are forced into data collection somehow? If so, how? I'm genuinely curious.
> why would someone include tech that makes people think twice about using the app, unless it is required if you want to "sell" in a particular venue.

Because the overwhelming majority of people don't think twice about this tech.

I do, and that's why I use a lot of web tools or old-fashioned phone calls, but most people think metadata=unimportant and assume that the purpose of the app is what it does for them rather than to gather their personal information for sale.

How is this legal under the GPDR? There is clear examples in the citizenlab document of a user been tracked inside of the EU from outside.

Is there not also a requirement for clean consent? Ie a weather app can’t track your precise location?

Companies exist that de-anonymize other data brokers data. Lets the other data brokers claim they have anonymized data while end end users get everything.
you could probably run a anonymization company at the same time you run a de-anonymization company
Best of both worlds - legal and profitable \s
> enough samples that you can apply statistics to find precise locations, in many cases you can de-anonymize the IDs

I think a lot of people don't realize the power of a big enough sample size. With enough samples even something pretty innocent looking like your daily step counter could make you identifiable.

As far as I know we don't have large enough databases to make this happen in practice, but I don't think this is impossible in the future.

How large are you estimating is "large enough"?
Location and identity are inextricably linked. You can't destroy identity without also destroying location and location is critical for myriad purposes.

The analytic reconstruction of identity from location is far more sophisticated than the scenarios people imagine. You don't need to know where they live to figure out who they are. Every human leaves a fingerprint in space-time.

> and location is critical for myriad purposes.

It's not though.

Critical for myriad elective purposes? Sure.

Only if you consider the entire concept of logistics in civilization as "elective".
Seems hyperbolic we had logistics that functioned extremely well before we had customer location data for sale on 3rd party sites.
If you re-read the comment they didn't say that selling it was intrinsic.
The article is about privacy tracking spyware cookies. I think making statements in that context about how modern logistics don't work with out location data implies you mean location data from those sources. I mean i suppose it doesn't have to but than it just feels off topic no?
I don't follow what you mean by 'logistics in civilization' as that's pretty vague and amorphous.

Could you be more specific with maybe a single example of where my physical geographic location is electronically critical for a purpose that isn't elective/optional/avoidable?

(And I'm not just trying to be obtuse. I think you're touching on at least part of the 'heart' of both this conversation and that of digital ID verification.)

How does tracking the movements of individual humans aid shipping and logistics, other than providing traffic data to freight companies? How did we manage to have global supply chains prior to GPS being invented?

Edit: I assume I am missing a crucial part of logistics that you’re familiar with, genuinely curious.

In what sense can the latitude and longitude of my house be called anonymous data?
Ultimately, a map is anonymous data containing lat/lon of everyone's house

Alone, these points are not deanonymizing, it's when there's other data associated.

When such data is sold I'm pretty sure it would be more than just list of coordinates.
From what I've seen none of this is that complex, one could simply 'draw a circle around your house' and get all the "anonymized" device pings and just trace those.
Yep. With side channel/one order of thinking above the laws, its trivial to get around said laws. Need better laws.
> A lot of geolocation data on the market is anonymized

A lot isn't good enough.