Hacker News new | ask | show | jobs
by fhub 971 days ago
Truly anonymous data is not subject to the GDPR. So the question is whether the data they are collecting is truly anonymous. They seem to be claiming or suggesting "Yes it is" https://code.visualstudio.com/docs/getstarted/telemetry#_gdp....
4 comments

It is neigh impossible to send truly anonymous data as telemetry. As soon as you're using the internet, you're disclosing an IP address, which is PII. If you add anything to link two subsequent telemetry reports together, that thing is PII (e.g. a hash or a uuid). If the telemetry report is detailed enough that they become somewhat unique, it's PII.

That said, consent is not the only grounds on which you can process PII. Contract, legal obligation, vital interests, public task, or legitimate interests are also valid grounds. Of these, legitimate interests is the most applicable in this situation.

> As soon as you're using the internet, you're disclosing an IP address, which is PII

Yes it's PII which of course is why no one who does Telemetry in a GDPR compliant way would store the IP address. The fact that it's "sent" (in order to send anything at all over http) isn't relevant. Only what's stored, for what reason, and for how long.

> If you add anything to link two subsequent telemetry reports together, that thing is PII (e.g. a hash or a uuid)

Again, no. PII is only information about physical people. Unless the data becomes enough to identify a person (in itself or together with other data), the data is not PII. Having a browser history associated to a random guid might be PII (because the browser history might pinpoint the user, not the guid!). But having a random guid associated to say "has run VS code 12 times this year" is not.

>legitimate interests

No, telemetry is not something MS needs to fulfil the primary purpose of VS Code. Best example is that the OSS version is there, without any telemetry enabled by default, still doing by and large the same job.

Legitimate interest doesn't mean absolutely essential.

The OSS version obviously benefits from the telemetry (to the extent telemetry is useful) because it's downstream of the version developed based on the telemetry.

"Disclosing an IP address" maybe a matter of the medium of comms being inadvertently TCP/IP, if MS does not log or store the IP in a meaningful/reversible way, are they processing PII?
in the Google fonts CDN the court ruled that: it's irrelevant if the website or Google had the opportunity to link the IP address to the user. the mere possibility of this is enough to consider it as protected PII.
Question is whether Google Fonts CDN/server was storing the IP address or not. Linking to a user is secondary. If a server does not log or store raw IPs in the first place, where's the fault?
My man you are arguing with an established case verdict. https://rewis.io/urteile/urteil/lhm-20-01-2022-3-o-1749320/ The wording that is irrelevant what Google does with the IP (just the theoretical possibility of misuse is enough) is in the case verdict.
Per Wikipedia Germany's legal system doesn't have the concept of binding precedent. (And even if it did in no country is the decision of a trial court binding precedent).
With that argument - would it hypothetically be legal for anonymised telemetry to be submitted over Tor?
no, the IP should not be exposed to any third party not only to the final destination. Tor would hide the IP from the final destination but still expose it to the first relaying party.
The first relaying party would see the IP address, but none of the telemetry data. I think it's only the combination of the two that is legally a problem,
> It is neigh impossible

Haha sorry I couldn't continue past that! Neeeiiigggh!

Among the telemetry data:

> MacAddressHash - Used to identify a user of VS Code. This is hashed once on the client side and then hashed again on the pipeline side to make it impossible to identify a given user. On VS Code for the Web, a UUID is generated for this case.

A hash of a hash is about as expansive as a hash and it still uniquely identifies a machine, tying telemetry events to a specific user's machine. Microsoft's own telemetry description generator calls the field "EndUserPseudonymizedInformation". Pseudonymisation is inherently not anonymisation.

This bullshit is why I keep my PiHole on for my dev environment.

Unless there is any PII associated with the pseudonym, there is nothing specifically in GDPR that says you can’t or shouldn’t do this so long as it’s not information that can identify a physical person. Note that being able to attribute multiple pieces of data to the same anonymous person does not necessarily identify them (and it’s important to not accidentally do so):

It’s important though if you e.g have multiple products to use a _different_ pseudonymization (hash salt or whatever) otherwise you run the risk of storing data linking too much data on a user thereby de-pseudonymizing them in the worst case even though no individual app does. Having a users behavior across multiple applications could pose such a risk in extreme cases.

Edit: I think it's important to separate "hashing" and "hashing". A properly hashed identifier uses a salt that is generated on the client, so that it can't be used to identify the user. basically: the first time the app runs, you generate a random salt which is only stored on the client, and NEVER sent in telemetry. Anything you would like to transmit over the wire that would risk identifying the user (E.g. a computer name, mac address) you hash with this local salt. This way no one can try to go to the database on the server side and try to match any data e.g. check if the hash abc123 matches the computername jimbob bcause hash("jimbob")= abc123. Just sending hash(MacAddress) without a local random salt would NOT be properly pseudonymous because an attacker on the server side could ask and answer the the question "Does this come from the address macaddress?".

The hash used, at least when Iooked into it last, was a plain sha256 hash, no salt or pepper. That's a unique identifier.

I think the massive amounts of behaviour analysis Microsoft does should be considered PII. They know when you turn in visual studio in the morning, and when you leave. They know when you go to lunch and don't click any buttons for a while, and they can see the colleagues with you in that boring meeting also not clicking any buttons at the same time. This type of behaviour analysis over time can associate you and the people you interact with, even if it's not directly tied to a reversible hardware ID.

This is why pseudonymisation isn't anonymisation, and why pseudonymisation isn't sufficient to comply with laws liker he GDPR.

If the behaviour analysis was done without identifiers at all, you could say they're just counting button clicks, but they intentionally associate this data with your stable personal identifier for analysis over time.

MAC addresses aren't that big of a collision space either, any consumer GPU can generate a list of all hardware MAC addresses in use in a reasonable amount of time. MAC addresses may theoretically be 2^48 in size, but most of the space hasn't been assigned to vendors yet. It takes about 12 minutes to reverse any given MAC address when you rent a single cloud GPU. The double hashing should take about twice that time.

The weird thing is that Microsoft intentionally chose to use a MAC address rather than a UUID like they use on their web version. If this was just a unique user token, they wouldn't need to use any hardware identifiers at all.

You are right in the edit. The hash needs to be using a secret salt that is unavailable to any potential attacker to not be PII.

You're mixing up the termso psedonymization and anononymization, though. If something provably not PII, it is considered anonymous. Psedonymization specifically means to keep the data as PII, but where the risk of misuse is reduced by making the identification hard.

In practical terms, psedonymous data is data that someone like a data scientist will only be able to link to a person if making a deliberate effort to do so, which will almost certainly mean that she KNOWS she is breaking some law. And it may also mean that the link between the person and the pseudonym is stored in a locked down database where most data scientists (or others that may have interest in doing the linking) do not even have access.

The GDPR does promote the use of pseudonymization as a layer of protection, and if a business does keep some PII data around, properly categorizes their data as such (in compliance with Article 30 of GDPR, with a defined "Legal Ground" for processing activities) AND properly protects the data both through "Security by Design" and "Privacy by Design" (of which pseudoymization is an important element), their legal exposure can be either completely negated or at least radically reduced if the "Legal Ground" is challenged.

Overall, though, fully understanding GDPR is terribly difficult, as it requires significant understanding of both Law (International AND local within each country covered by the GDPR), Computer Science (development AND IT security) AND a good understanding of Data Science.

I rarely meet people with enough understanding of all 3 to assess practices that are in the gray zone.

Lawyers (and most DPO's) tend to have little understanding of the IT or Data Science aspects, but tend to be good at stretching a "Legal Ground" to whatever is needed by the business to continue to be profitable.

Data Scientists tend to know how to de-pseudonymize data, and may even be taught "Privacy by Design" (this usually has to be forced on them, though, as it makes their job harder). Most data scientists struggle with IT security aspects, though, and would in many cases happily download all data to their laptops if they could.

Developers/engineers may understand concepts such as hashing, and even know the difference between hashed and encrypted data. However, as they live in a boolean world of True vs False, using judgement to evaluate the risk impact of some practice for data subjects tends to be alien to them. In a black and white world, this group tends to think that every bad practice is equally bad, instead of going for the "lesser wrong" or "good enough". Especially if the measures needed to be "good enough" makes the coding harder or the system slower.

Finally, IT security (the experts, not the drones) MAY have a better understanding of degrees of risk than developers, but tend to know/care less about the actual data than any other group.

And each group tend to hold the other groups to a higher standard than their own. The lawyers tend to assume that all aspects of development and infrastructure is properly hardened. Data Scientists tend to interpret the "Legal Ground" to cover whatever they want to use the data for. Developers tend to think that the infra that runs their systems is fully secured by shell protection, and may even store "secrets" in more or less open git repos (and even if they delete it later, they don't clean up the git history or create new secrets). And networking often do not even care about anything in the "Application Level" or higher of the networking stack.

So in practice, any large corporation will have a huge number of vulnerabilities. The only way any sensitive asset (from a privacy, intellectual property or operational stability perspective) can be considered properly protected is to have multiple layers of protection, all or most of which must fail for major incidents to happen.

I use pseudonymization in the sense of having persistent identifiers for users/machines/etc that cannot be reversed on the server side.

Basically: just like the usernames on hn are pseudonyms it’s important they are persistent so you can follow who wrote what despite not being able to attribute posts to physical persons. That is: hn is a pseudonymous forum rather than anonymous.

The hash(localSalt + PII) is provably not PII. But it’s still making the data possible to correlate. The telemetry event I send on Monday can be attributed to the same source as the event I send on Tuesday.

what's the definition of truly anonymous? they don't know your name? or there isn't enough data to identify you? I've heard that in the US, birthday and postal zip code is enough to identify you in most of the country, but that could be considered anonymous.

if data of multiple users is aggregated, that is I think more of what people are thinking when they think "anonymous"

There are multiple definitions. The most basic (and common) is k-anonymity [1]. Basically, for a given collection of data you group by all variables that are already non-anonymous (like age, address, gender, occupation) and end up with groups of fewer than k people (where k=5 is common), any other data items in the data set linked to the same individual also become non-anonymous (PII).

Even if you have groups of size greater than k, though, information elements may be non-anonymous if there is not enough diversity in the group. For instance, if every 49-year-old male on a given postal code in a given occupation has a certain religion, then religion is non-anonymous for that group, according to l-diversity [2].

This can be narrowed down even more by t-closeness [3].

  [1] https://en.wikipedia.org/wiki/K-anonymity
  [2] https://en.wikipedia.org/wiki/L-diversity
  [3] https://en.wikipedia.org/wiki/T-closeness
There is no such thing as truly anonymous. in order to send any data you need to connect to a server. at that moment you are in violatation of GDPR because you are exposing the users's IP which is protected by GDPR. See the case where even linking to a CDN requires GDPR consent. https://www.cpomagazine.com/data-protection/leak-of-ip-addre...

And before the army of those who don't understand GDPR comes up with "but then the whole internet can not work"; the crucial distinction comes in the answer to the question: "can this tool fulfill its purpose without this connection? if no, then it's essential to it's functioning and does not require consent, if the tool can fullfll it's purpose without this conection it's optional and does require consent.

GDPR makes a disticntion for connection that are required to fullfill the purpose of the tool and connections that are not essential. So VS code connection to a microsoft Server to let's say update download an extension is allowed and does not require consent becasue without that connection VSCode cannot fullfil its purpose of providing functionality.

Telemetry is not functionaliy and VSCode can execute it's purpose without this connection so that makes it subject to user consent requirement.

By that logic, Ubuntu performs a connectivity check behind the scenes polling connectivity-check.ubuntu.com every few mins to detect if internet connectivity has been lost.

I do not recollect seeing any opt-in Privacy prompt enabling this feature. Surely an OS can function without the internet so it's not "essential to its functioning".

Same with Firefox's captive portal check [1] that helps determine if a Wifi network requires a web-based sign-in or acceptance of terms of use.

[1] https://en.wikipedia.org/wiki/Captive_portal

yes, Ubuntu is in violation of GDPR too if it does not connect for essential functionality. One essential functionality that is acceptable for any OS is that of checking for updates because Security is an essential part of OS.
Wouldn't even be checking Microsoft's server be an unnecessary connection? You could argue, that VSCode would still work, as updates are basically optional and could be triggered manually, too
Yes, I meant connecting to update/install in response to a user action that wants to install extension for "X functionality".
> There is no such thing as truly anonymous. in order to send any data you need to connect to a server. at that moment you are in violatation of GDPR because you are exposing the users's IP which is protected by GDPR.

This is misinformed. There is nothing in the GDPR that relates to "exposing" or "transmitting" anything (other than transmitting further from a processor to a third party). GDPR relates to how data is stored or processed. A program can make any number of http requests, for any reason no matter how unnecessary, so long as that PII (The IP, or similar) isn't stored or otherwise processed/transmitted to a third party in a way that the GDPR concerns. The download web server logs is such a storage (which is why you these days clear those every day, or never log IP at all in them).

> Telemetry is not functionaliy and VSCode can execute it's purpose without this connection so that makes it subject to user consent requirement.

No. It's required because the telemetry data is stored whereas the IP of the update request is not. Had microsoft wanted to store every IP of everyone downloading an update, then that database of IP's/downloads would of course have been subject to the GDPR too. The data isn't less sensitive just because it was from a necessary function. Microsoft's responsibility for that data is exactly the same.

But the easiest way of doing telemetry properly and not worry about GDPR is to not store anything that is PII at all. And it's pretty easy to do so too. Nothing is "Truly anonymous". Telemetry is usually pseudonymous. But it properly pseudonymous telemetry is normally not a privacy concern in any way. The true gripes about telemetry (there are a few valid ones) isn't about that, they are

- People getting a worse experience e.g. a slower product

- People not trusting the companies to adhere to the GDPR with the data transmitted, e.g. you might not trust the server to clear IP's from the transmission (basically the only piece of PII that can't be cleared on the client side because then the package never arrives). But if you don't trust the company to adhere to the GDPR then why would one trust their opt-out does anything? Running any kind of software basically means trust to some extent.

- People feeling cheated because of automatic or hidden opt-in

- People on paid internet connections spending money to send the telemetry.

I last studied the gdpr years ago but that most definitely appears false, provide your sources.

The GDPR deals with "processing" and this is the definition of processing:

" ‘processing’ means any operation or set of operations which is performed on personal data or on sets of personal data, whether or not by automated means, such as collection, recording, organisation, structuring, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, restriction, erasure or destruction; "

Note the "transmission, dissemination or otherwise making available".

I could be mistaken but I think whether the http request makes anything ‘available by transmission’ is down to the definition of who is the data controller and which data processors exist. So in the case of telemetry where no PII changes hands, and no PII is stored, then I can’t see how it applies. That is, assuming that the Telemetry backend here belongs to the same entity that made the app. Such as if a microsoft product phones home to its own backend.

Apps that make http requests to other endpoints belonging to third parties are much murkier.

As far as consent is concerned: Whether consent is required for making a http request containing an IP in the header based on legitimate interest is also murky. Consent is only one way of permitting the processing. Whether Telemetry is legitimate interest I don’t think is established. But it’s important to remember that not only “absolutely essential” functionality that is a legitimate interest. That is: something isn’t automatically not legitimate because it could be removed and still deliver the functionality to the user. Online ads are contested (because profit can be a legitimate interest). The same for telemetry. It’s certainly of interest to the developer to get the data. I have not seen any rulings yet on that but Microsoft has made a pretty decent legal analysis when they conclude that they will never need consent here.

A web server owner can even store data for some time since preventing denial of service attacks could mean they need to store IPs for a short while before deleting. As that’s a legitimate interest, this would not require user consent from visitors.

So first of all you said "There is nothing in the GDPR that relates to "exposing" or "transmitting" anything (other than transmitting further from a processor to a third party). GDPR relates to how data is stored or processed." .

That was false, since the definition of processing explicitly includes transmitting.

VS Code requires accepting the all-encompassing Microsoft privacy statement, and I couldn't find quickly what legal reasons they use for telemetry.

"Legitimate reasons" can practically indeed mean almost anything, and the only limits to it are those placed by subsequent guidances or interpretations of the central or local privacy authorities. It's what largely makes the gdpr a joke. It's very likely that Microsoft relies on it, whether that's acceptable or not.

You seem to consider a local software as part of the software's copyright holder infrastructure, and that appears ludicrous, transmission of usage data from a local application to an other company's server is most definitely transmission.

If VS Studio's telemetry is legal or not I don't know and I'm not interested in delving into it right now, if I had to use it I'd block it and probably wouldn't use it if it became impossible.