Hacker News new | ask | show | jobs
by Sir_Substance 3119 days ago
>The right to be forgotten, becomes one of the hardest challenges because of data immutability. Apache Kafka does not support deleting records, and although some eventual deletion is supported, it requires

This always seemed like an incredibly toxic decision to me. It's one that crops up in all sorts of systems, large and small. What, none of these people /ever/ foresaw the need to delete some data?

4 comments

As everything in life, to gain something, you need to sacrifice something else. With RDBMS you get mutability; but to go 10x or 100x faster/larger you need to make hard decisions.

HDFS, S3 and other systems have immutability in-built. Immutability is not bad per-se, as it give (some) assurance that data has not been tampered with, and although it could be implemented, the system cost could be significant.

Stricking the right balance is the challenge

It's not that simple. For example in my business, we may give some money to help someone "once in its life" (the law says so). Therefore, if the persons asks to be deleted, then we might not apply the law anymore because it'll mean we won't remember the decision... I think GDPR is a good thing, but at some point, in my business, those who write the laws will have to be aware of it (and the legal teams is miles away from the IT stuff, sadly).
The GDPR offers exceptions to the right to erasure, this mostly includes legal compliance (banks) or in the interest of legal claims or when data cannot be easily deleted as individual record. It also does not affect any non-digital documents which aren't filed. This is all laid out very thoroughly in the legal documents relating to this.
I must recognize I didn't read the section about removal thoroughly. But I did read the articles about the "categories of data" which are the major pain point right now 'cos it forces you to, well, find appropriate categories of data. It's a very interesting thing to do but, in my organization, it leads to many loooong discussions :-)
GDPR has an exemption related to the legal requirement to process data that might cover this (and related) scenarios.

> ...(unless) processing is necessary for compliance with a legal obligation to which the controller is subject;

Does this mean that someone can game 1-time special offers by repeatedly signing up and then demanding to be forgotten?

There's probably no legal obligation to enforce once-only cashback sign-up offers, so the right to be forgotten would presumably have to be followed.

There is an exception category for “legitimate business interest” so we’ll probably have to wait and see what the courts have to say.
Where integrity matters, you never want data to be mutated with no trace. An audit trail is almost always needed - it’s not an extreme leap, then, to say “why don’t we just replay the audit trail to arrive at the current state?”
>What, none of these people /ever/ foresaw the need to delete some data?

It's a performance trade off, and not a very surprising one. Hard Disk Drives have always been known to never actually delete data (if you want the data gone, you overwrite it with 0s). It's not unimaginable that this performance trade-off found its way up the stack.

And just like a regular HDD, you can forcibly delete the data, it's just a very expensive operation that isn't needed 95% of the time.

>And just like a regular HDD, you can forcibly delete the data

Except apparently not, because the linked article is literally saying it's not supported.

I get wanting an audit trail, and I get wanting to not delete data if you don't have to for performance reasons, but neither of those things is the same as saying "it's literally not possible to delete stuff".

> Except apparently not, because the linked article is literally saying it's not supported.

Not entirely true. Kafka, out of the box (and as far as I know, I'm no expert) will keep the records for 7 days and delete them afterwards.

Most people I know (including myself) use Kafka to keep records for longer and a good option is to use what the article describes, which is to compact the logs. In that case the log, after a configured period of time (or when it reaches determined size) gets compacted and only the latest message with an id gets saved, all previous messages with the same id get deleted (that's why the process needs a message with a null velue to perform the "delete").

Only in the case when you want to keep the data forever and can't use compaction (compaction assumes that your messages always contain the full state of an entity, so the last message will always contain the current state and the previous can be deleted with no side effects), then there's no way to delete a specific message. I'd have to read the exceptions for backups included in GDPR, but you could make the case that, in this case, the Kafka log is maintained only as a backup of the data, to be able to replay it again in case something downstream gets broken.

>Except apparently not, because the linked article is literally saying it's not supported.

I see your point, but you are mistaken (or you took the wrong impression from the article) - it is supported, you just wouldn't want to do it day-to-day, and for the context of the article it might as well not exist. In Kafka, if you want to forcibly delete the data, you could simply just force topic compaction after a delete. Depending on the size of your data, a regular delete could take hours, which would likely blow the resource usage on any decently sized deployment.

I bring this up because a lot of shiny "BigData" databases use Log Structured Merge Trees, which are immutable and deletes are mostly "soft-deletes" until a "compaction".

What is your thought on this?

> Encrypt with a user specific key when the data enters the log. You can effectively delete all the user specific data by throwing the key away. No tracking down files or reprocessing necessary.

from https://news.ycombinator.com/item?id=15847674

From a technical perspective, that seems like a sensible compromise to me.