| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jsmith45 1614 days ago

How about a site like Wikipedia?

There is relatively little non-public information about about users kept. The email address, date and time of a few first time actions (like creating the account, verifying email address, going to an edit page, etc), some account settings like language. They do keep some data short term, like track of the ip address a given user signs in with. I'm not certain how long this data is kept for but apparently up to 90 days. This is one of the tools used to check for certain types of abuse by logged in users, like sockpupetry.

The majority of information about a user that the site stores is publically displayed information clearly voluntarily submitted, with implied consent for use, like what pages the user edited and when (public info), information they choose to add to their user page etc.

But never the less, Wikipedia is potentially a pile of GDPR violations, despite pretty clearly not doing the sort of stuff the GDPR is trying to restrict.

Potential violations include:

#1. When an anonymous user edits the site, the edit is publicly attributed to an IP address, which is kept forever. IP Addresses are considered personal data under the GDPR. It is not feasible to only keep these address for 30 days, as all edits need to be attributed to something. It is not at all clear that keeping the IP address indefinitely for this falls on the correct side of the legitimate interest line here. So this could well be a GDPR violation.

#2. What about users requesting deletion? While the project can delete the user-pages, and even rename the account to something non-specific (like renaming away from being the User's real name), it is likely to not be terribly difficult for someone to identify the renamed user, especially if they ever left a signed message on a talk page. Retroactively modifying such past edits, and editing other people's posts that referenced your old username would be too disruptive. But it is not 100% clear that what Wikipedia reasonably can do is enough under the GDPR.

Also, technically speaking as a rule Wikipedia never actually deletes revisions from the database unless technical reasons require it. Deleted articles are no longer visible but are still stored in the DB. Even copyright violations are normally only rev-deleted (can be restored by admins), or Suppressed (can be restored by oversight users). This sort of not-actual deletion might not actually be enough under the GDPR.

#3a. Let's say a user submits a data access request. Wikipedia could provide them with their own email address, profile settings, non-public temporally logged information about the user, like the IP addresses used to log in. They could provide a copy of the user pages, and even the complete history of them, as well as all the edits the user has made, possible even edits that are not currently public. (Like articles that have been deleted, edits that were suppressed, etc).But is that all really enough?

What if other users on some talk page end up talking about this user, without specifying the username (so Wikimedia Foundation cannot easily find the reference), but the prose is sufficiently specific to clearly identify this natural person? The posts could potentially even reference other interesting data about the person, like their religion. While Wikimedia foundation may not have the sort of AI needed to parse the conversation and extract the personal data and associate it with the user in question, by the strict letter of the GDPR it still counts as personal data, and there is no infeasibility exception to disclosing it, so if the user later find this conversation, and then wants the relevant data protection agency to go after Wikipedia, the agency technically could justify issuing a fine here. Is is likely to actually happen? Of course not! But it could if for some reason the relevant people at the agency has a personal grudge against Wikipedia.

#3b. Once again a data access request: What if the user is actually a also public figure. Surely they would also need to be given a copy of their article, and possibly the complete history of the article. But there could well be other articles that reference this person and it is not necessarily feasible to automatically find all of them, especially if any don't explicitly link to the subject's main article. Once again, strictly speaking not providing any personal data contained in those other articles would be a violation of the letter of the GDPR, despite not violating the spirit.

------------------------------

These are only a handful of edge cases I can come up with. In all of these scenarios Wikipedia is being very reasonable, and is not trying to collect any more personal data than needed to run their site, and is being fairly reasonable in trying to balance user's rights to with practical considerations. But they still have multiple places where it could be argued they violate the GDPR nevertheless. They are not an evil company trying to collect personal data and mine it for profit or sell it. But the extremely vagueness about details contained in the GDPR makes it so it is hard really have any idea for sure if they are on the correct side of it or not.

This is true despite the fact that no data protection agency is likely to every try to take action against the Wikimedia Foundation for such violations, simply because in practice their actions are good enough, and trying to attack something like Wikipedia will likely piss off the population that want the agency instead going after Facebook, or companies who have massive data leaks they try to hide.

One might argue that Article 85 might be interpreted to protect Wikipedia under freedom of expression and information. Or perhaps one might say that the data qualifies for processing under the Article 6 1(e) because identifying users modifying a public resource is a necessary part of the task of developing Wikipedia itself, which is a task in the public interest (questionable, but not impossible to try to argue). But let's say it was not actually Wikipedia in question, but some other forum of user provided content with similar limitations, that might not qualify for extended protections for freedom of expression and information, or as a task in the public interest?

Some of these same sort of concerns technically apply to any sort of online public discussion forum, even ones that are very much not trying to collect personal information, beyond the bare minimum they need for accounts and anti-abuse. Even this very forum we are on right now can potentially suffer from the "other people talking about you in an identifiable way", but admins cannot find the conversation to provide it to you for an access request problem.

1 comments

M2Ys4U 1614 days ago

I think you're over-complicating some of these.

On point 1, I think legitimate interests covers this fairly well, but it would be arguable for sure.

On point 2, the right to erasure is not absolute so the fact that data are not purged from the database is not relevant. Legitimate interests also come in to play here.

On point 3a, the GDPR only mandates that data subjects are given access to personal data, so the WMF need not collate the information to send to them. Surfacing rev-deleted data might be more tricky, I suppose, Wikipedia has policies against posting personal data of other users and such edits will be oversighted where it's brought to admin attention (see WP:OUTING).

On point 3b, again the legal requirement is to give access to the data. Rectification and erasure is also straightforward (edit the page, ask for other edits to be revdeled/oversighted if the violate WP:BLP). Like you say, Article 85 offers wide protection here, too.

link