Hacker News new | ask | show | jobs
by Sephr 706 days ago
This service claims to not track personal data, yet their docs admit to storing hash(siteID + User-Agent + IP) + seen_paths on their backend for session tracking.[1]

Sites can track sessions without tracking personal data.

1. https://www.goatcounter.com/help/sessions

3 comments

right below that the docs also say that this hash is not persisted, only cached in memory and mapped to a UUIDv4. The UUIDv4 is what persists between sessions.

> The IP address and User-Agent are never stored to the database or disk, and there is no conceivable way to trace the random UUID back to this. > > It’s only stored in memory, which is needed anyway for basic networking to work.

I can't say whether that is GPDR compliant but it's definitely not storing the hash

> Sites can track sessions without tracking personal data.

Could you detail how that would work?

Fetch an empty resource that is privately cacheable, set to max-age=0, and has an ETag containing the current timestamp and a random session id. The browser will consider its cached copy always stale.

When you next fetch that resource, because it is stale, the browser will revalidate it by passing an If-None-Match header containing the ETag. Update the ETag to include the original timestamp and the current timestamp.

So on every page load (or whichever other event you want to measure), you will be told when that session started, the session id and when that visitor was last seen.

To set the maximum session duration, reset the ETag if the last seen timestamp passed to you in If-None-Match is too long ago.

This can even work without JavaScript by using an img element.

The only data tracked with this is the session start time, last seen time, and a random session id. Since the session id isn’t related to any of your business logic, it cannot be used to identify an individual.

To further isolate this data, locate the tracking resource on a different hostname. The browser’s SOP will prevent any cookies from being sent with the request, so your analytics backend can’t record identifying information even if it wanted to. This will also prevent you from tracking which page is being visited, though you can override that with the no-referrer-when-downgrade referrer policy.

That's just a cookie. And then you're back to the annoying consent banners.
You just reinvented analytics cookies. You’d be surprised, but they don’t store PII either. It’s usually just a randomized session ID and timestamps, like you’re suggesting.
Why do all this when you can set a cookie with a random session ID?
In browsers, it's as simple as:

    if (!sessionStorage.sessionReported) {
      reportSession();
      sessionStorage.sessionReported = 1;
    }
„ In comparison, in the context of the European GDPR, the Article 29 Working Party[6] considered hashing to be a technique for pseudonymization that “reduces the linkability of a dataset with the original identity of a data subject” and thus “is a useful security measure,” but is “not a method of anonymisation.”[7] In other words, from the perspective of the Article 29 Working Party, while hashing might be a useful security technique, it is not sufficient to convert personal data into deidentified data.“

https://www.gtlaw-dataprivacydish.com/2021/03/what-is-hashin...

I am a DPO. The claims Plausible makes won't hold up to scrutiny.

It's a simple trick: declaring all data collected to technical data, when in fact it is linkable to a data subject.

Thus collection of the data requires consent, because a subject is identified at least for the session.

If you can identify unique visitors you are clearly identifying individuals.

Indeed you are correct. Plausible it is not. They should put their cookie consent back up, and need to inform their users how they are indeed processing the data collected from personal users.

  hash(daily_salt + website_domain + ip_address + user_agent)
That's what they do. Within 24 hours the daily salt is gone, and the data is anonymous.

https://plausible.io/data-policy#how-we-count-unique-users-w...

problem is that this is what they say they do, there are too many examples of companies being noncompliant to their own policies and regulations. they should explain the abovementioned algorithm in their data privacy declaration published online. also even a hash can be considered as a private and personal data unless it has been protected sufficiently. thus need to inform your users anyway.
Good approach. IP Addresses are personal data. So the data and the hash is subject to GDPR.

You still need consent to collect it - well or some other kind of legal shenanigans. The intent is to track a person, it is not technically necessary. You might have a legitimate interest - but in the end you still have to consider the GDPR to use this tool.

https://europa.eu/youreurope/business/dealing-with-customers...

Turns out that many officials believe this is fine. Companies using Plausible, Matomo and similar services have been under scrutiny.

IP adress is required for site to function - your server cant not collect it. Plausible also only processes it for uniqueness and doesnt save it as is. Interestingly most webservers/firewalls will have to keep track of ip adresses so they will be saved in acess logs and caches. Making them more problematic than Plausible. Yet its most likely fine because the intent is not to track individual users but to improve service/keep it runing. Plausible intent is also not track individual users but collect visitor counts which is something used for improving service too.

I think you might be prematurely spreading fear.

> Turns out that many officials believe this is fine.

Who has gone on record with this, and in which jurisdictions?

> Plausible also only processes it for uniqueness and doesnt save it as is

That's exactly the point. Processing of personal data to identify a unique person.

Regarding firewalls and logs: It's argued that this is legitimate interest as it is stated in Recital 49 of the GDPR. So they got a free pass, for the better or worth.

> I think you might be permanently spreading fear

Don't get me wrong, I like the approach. But it's not a get out of GDPR free card.

That's a bit simplistic. IP addresses are not unequivocally personal data. Let's rewind back a bit, GDPR Art. 4:

> ‘personal data’ means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person;

IP addresses only allow to identify a natural person when combined with other data, such as ISP data or a profile built over dozens of websites. This is not the same kind of personal data as a name + address, Breyer notwithstanding (note the bit about the ISP in the judgment).

GDPR is not about identifying an abstract entity, it's about identifying a natural person. Doing the former for long enough/with enough data allows the latter, but especially with time-limited in-memory hashes that's a non-existent window of opportunity.

In practice this'd probably need to be resolved in court, and I'm sure not a single SME using Plausible or similar will even get a stern letter, much less fined.

> In practice this'd probably need to be resolved in court, and I'm sure not a single SME using Plausible or similar will even get a stern letter, much less fined.

Agreed.

Plausible just makes false claims like:

> All the site measurement is carried out absolutely anonymously. Cookies are not used and no personal data is collected. There are no persistent identifiers.

That's a heavy statement and it is simply not true, as you quoted:

> an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person

hash(daily_salt + website_domain + ip_address + user_agent) will fall under this definition.

But again, you are right, better then anything any other service does

What’s your thought on the approach adjust.com takes? They say you can claim legitimate interest
what are your thought on aggregated data? you can still identify unique visitors but its aggregated data so you can't link it back to the individual.

I have doubts that just identifying unique visitors would also identify individuals. Their current approach of creating random id which is unique for 24 hours should not violate GDPR? or it would?

You begin at a point where you have data to aggregate. This data is linked to individuals.

Anonymisation of data is data processing and some argue, that it is subject to a privacy impact assessment. Arguing that if done poorly it has great negative consequences for the individual if they can be deanonymized.

The duration itself does not change the outcome.

Thus said the approach Plausible takes is much better than any cookie used.

I think you can argue if this holds up: you cannot retrieve the ip from the hash (and residential IPs are usually dynamic). The short lifetime together with never storing the hash makes it so you cannot de-anonymise the user.

No one will get fined for not asking consent for this. Our DPO just said ‘don’t be silly’ when I asked him. But we will see if it gets tested (my bet: it won’t).

> I think you can argue if this holds up:

Sadly, reckons don't hold up in court.

> you cannot retrieve the ip from the hash

You don't need to retrieve the ip to make it PII, the hash itself is PII.

You might not think of it as containing actual "personal information", but its sole purpose is to attempt to uniquely identify a person. That makes it PII.

> (and residential IPs are usually dynamic)

This actually makes the short lifetime more suitable as a PII, because it reduces the likelihood of the same IP being used by a different person being tracked as the same person.

> The short lifetime together with never storing the hash makes it so you cannot de-anonymise the user.

That also doesn't matter, because the lifetime of the token is long enough to track the user through and entire typical session, maybe several.

The stupid thing in all these shenanigans is that collecting the data isn't itself the problem, it's not getting the user's consent. Just tell the user what you're doing, and it's not a problem - if it's a "technically required" cookie they can make an informed choice to use your site or not, if it's an "optionally required" cookie, they can choose whether to accept or not. Most users won't care and will click on the biggest, most obvious buttons. The ones that do care are likely atypical and would skew your metrics anyway.

> you cannot retrieve the ip from the hash

You can as long as you have IPv4 visitors, because the search space is small enough to brute-force. There are only four billion IP addresses. The user-agent complicates things a little but there aren’t many of those, so you could retrieve the IP addresses of most visitors from the hash if you wanted to.

> residential IPs are usually dynamic

Usually isn’t good enough. I’ve had residential IPs that are on public record belonging to me personally. IP addresses can be personally identifying information, so they need to be treated that way.

You would still have to produce the paperwork for this.

Most websites don't get fined using GA. Plausible is a huge step in the right direction, but their claims are very strong and not backed up by the GDPR if you take a closer look.

Regarding fines: most offices will give you a warning instead of a fine, you adjust your cookie banner and you are good to go