| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by hyperhopper 378 days ago
	This is the real news. It should be illegal to call something deleted when it is not.

4 comments

girvo 378 days ago

> It should be illegal to call something deleted when it is not.

I don't disagree, but that ship sailed at least 15+ years ago. Soft delete is the name of the game basically everywhere...

link

aranelsurion 378 days ago

Consequently all your "deleted chats" might one day become public if someone manages to dump some tables off OpenAI's databases.

Maybe not today on its heyday, but who knows what happens in 20 years once OpenAI becomes Yahoo of AI, or loses much of its value, gets scrapped for parts and bought by less sophisticated owners.

It's better to regard that data as already public.

link

eurekin 378 days ago

At work we dutifully delete all data on a GDPR request

link

sahila 378 days ago

How do you manage deleting data from backups? Do you know not take backups?

link

crdrost 378 days ago

"When data subjects exercise one of their rights, the controller must respond within one month. If the request is too complex and more time is needed to answer, then your organisation may extend the time limit by two further months, provided that the data subject is informed within one month after receiving the request."

Backup retention policy 60 days, respond within a week or two telling someone that you have purged their data from the main database but that these backups exist and cannot be changed, but that they will be automatically deleted in 60 days.

The only real difficulty is if those backups are actually restored, then the user deletion needs to be replayed, which is something that would be easy to forget.

link

Gigachad 378 days ago

Probably most just ignore backups. But there were some good proposals where you encrypt every users data with their own key. So a full delete is just deleting the users encryption key, rendering all data everywhere including backups inaccessible.

link

jandrewrogers 378 days ago

Deletion via encryption only works if every user’s data is completely separate from every other user’s data in the storage layer. This is rarely the case in databases, indexes, etc. It also is often infeasible if the number of users is very large (key schedule state alone will blow up your CPU cache).

Databases with data from multiple users largely can’t work this way unless you are comfortable with a several order of magnitude loss of performance. It has been built many times but performance is so poor that it is deemed unusable.

link

blagie 377 days ago

The entire mess isn't with data in databases, but on laptops for offline analysis, in log files, backups, etc.

It's easy enough to have a SQL query to delete a users' data from the production database for real.

It's all the other places the data goes that's a mess, and a robust system of deletion via encryption could work fine in most of those places, at least in the abstract with the proper tooling.

link

alisonatwork 377 days ago

Some of these issues could perhaps be addressed by having fixed retention of PII in the online systems, and encryption at rest in the offline systems. If a user wants to access data of theirs which has gone offline, they take the decryption hit. Of course it helps to be critical about how much data should be retained in the first place.

It is true that protecting the user's privacy costs more than not protecting it, but some organizations feel a moral obligation or have a legal duty to do so. And some users value their own privacy enough that they are willing to deal with the decreased convenience.

As an engineer, I find it neat that figuring out how to delete data is often a more complicated problem than figuring out how to create it. I welcome government regulations that encourage more research and development in this area, since from my perspective that aligns actually-interesting technical work with the public good.

link

catlifeonmars 377 days ago

You can use row based encryption and store the encrypted encryption key alongside each row. You use a master key to decrypt the row encryption key and then decrypt the row each time you need to access it. This is the standard way of implementing it.

You can instead switch to a password-based key derivation function for the row encryption key if you want the row to be encrypted by a user provided password

link

liamYC 378 days ago

Smart, how do you backup the users encryption keys?

link

aiiane 378 days ago

A set of encryption keys is a lot smaller than the set of all user data, so it's much more viable to have both more redundant hot storage and more frequently rotated cold storage of just the keys.

link

Trasmatta 378 days ago

Most companies don't keep all backups in perpetuity, and instead have rolling backups over some period of time.

link

alisonatwork 378 days ago

Backups can have a fixed retention period.

link

sahila 377 days ago

Sure, but now when the backup is restored two weeks later, is the user redeleted or just forgotten about?

link

alisonatwork 377 days ago

Depends on the processes in place at the company. Presumably if a backup is restored, some kind of replay has to happen after that, otherwise all the other users are going to lose data that arrived in the interim. A catastrophic failure where both two weeks of user data and all the related events get irretrievably blackholed could still happen, sure, but any company where that is a regular occurrence likely has much bigger problems than complying with GDPR.

The point is that none of these problems are insurmountable - they are all processes and practices that have been in place since long before GDPR and long before I started in this industry 25+ years ago. Even if deletion is only eventually consistent, even if a few pieces of data slip through the cracks, it is not hard to have policies in place that at least provide a best effort at upholding users' privacy and complying with the regulations.

Organizations who choose not to bother, claiming that it's all too difficult, or that because deletion cannot be done 100% perfectly it should not even be attempted at all, are making weak excuses. The cynical take would be that they are just covering for the fact that they really do not respect their users' privacy and simply do not want to give up even the slightest chance of extracting value from that data they illegally and immorally choose to retain.

link

simonw 378 days ago

Purely out of interest, how do you verify that the GDPR request is coming from the actual user and not an imposter?

link

dijksterhuis 378 days ago

> The organisation might need you to prove your identity. However, they should only ask you for just enough information to be sure you are the right person. If they do this, then the one-month time period to respond to your request begins from when they receive this additional information.

https://ico.org.uk/for-the-public/your-right-to-get-your-dat...

link

eurekin 377 days ago

In my domain, our set of services only authorizes Customer Centre system to do so. I guess I'd need to ask them for details, but I always assumed they have checks in place

link

gruez 378 days ago

That won't work in this case, because I doubt GDPR requests override court orders.

link

miki123211 378 days ago

This is very, very hard in practice.

With how modern systems, languages, databases and file systems are designed, deletion often means "mark this as deleted" or "erase the location of this data". This is true on all possible levels of the stack, from hardware to high-level application frameworks.

Changing this would slow computers down massively. Just to give a few examples, backups would be prohibited, so would be garbage collection and all existing SSD drives. File systems would have to wipe data on unlink(), which would increase drive wear and turn operations which everybody assumed were O(1) for years into O(n), and existing software isn't prepared for that. Same with zeroing out memory pages, OSes would have to be redesigned to do it all at once when a process terminates, and we just don't know what the performance impact of that would be.

link

Gigachad 378 days ago

You just do it the way fast storage wipes do it. Encrypt everything, and to delete you delete the decryption key. If a user wants to clear their personal data, you delete their decryption key and all of their data is burned without having to physically modify it.

link

jandrewrogers 378 days ago

That only works if you have a single key at the block level, like an encryption key per file. It essentially doesn’t work for data that is finely mixed with different keys such as in a database. Encryption works on byte blocks, 16-bytes in the case of AES. Modern data representations interleave data at the bit level for performance and efficiency reasons. How do you encrypt a block with several users data in it? Separating these out into individual blocks is extremely expensive in several dimensions.

There have been several attempts to build e.g. databases that worked this way. The performance and scalability was so poor compared to normal databases that they were essentially unusable.

link

girvo 378 days ago

It would be very hard to change technically, yes.

But that's not the only solve. It's easy to change the words we use instead to make it clear to users that the data isn't irrevocably deleted.

link

Aeolun 378 days ago

Or maybe it should be illegal to have a court order that the privacy of millions of people should be infringed? I’m with OpenAI on this one, regardless of their less than pure reasons. You don’t get to wiretap all of the US population, and that’s essentially what they are doing here.

link

amanaplanacanal 377 days ago

They are preserving evidence in a lawsuit. If you are concerned, you can try petitioning the court to keep your data private. I don't know how that would go.

link

djrj477dhsnv 377 days ago

The privacy of millions of people should take precedence over ease of evidence collection for a lawsuit.

link

Aeolun 377 days ago

You can use that same argument for wiretapping the US, because surely someone did something wrong. So we should just collect evidence on everyone on the off chance we need it.

link

baobun 377 days ago

That's already the case. Ever looked into the Snowden leaks?

link

JKCalhoun 378 days ago

"Marked" for deletion.

link

jandrewrogers 378 days ago

The concept of “deleted” is not black and white, it is a continuum (though I agree that this is a very soft delete). As a technical matter, it is surprisingly difficult and expensive to unrecoverably delete something with high assurance. Most deletes in real systems are much softer than people assume because it dramatically improves performance, scalability, and cost.

There have been many attempts to build e.g. databases that support deterministic hard deletes. Unfortunately, that feature is sufficiently ruinous to efficient software architecture that performance is extremely poor such that no one uses them.

link