Hacker News new | ask | show | jobs
by rm999 1860 days ago
>Soft deletes

This section is totally wrong IMO. What is the alternative? "Hard" deleting records from a table is usually a bad idea (unless it is for legal reasons), especially if that table's primary key is a foreign key in another table - imagine deleting a user and then having no idea who made an order. Setting a deleted/inactive flag is by far the least of two evils.

>when multiplied across all the analytics queries that you’ll run, this exclusion quickly starts to become a serious drag

I disagree, modern analytics databases filter cheaply and easily. I have scaled data orgs 10-50x and never seen this become an issue. And if this is really an issue, you can remove these records in a transform layer before it hits your analytics team, e.g. in your data warehouse.

>soft deletes introduce yet another place where different users can make different assumptions

Again, you can transform these records out.

4 comments

Most order forms are snapshots of data at the instant of their lodgement, since they are sales contracts. It is a rookie blunder to link them relationally to master data for products and PII &c.

The record of an order is not intrinsically PII and thereby subject to rights of erasure. It may well be equally unlawful in some jurisdictions to irrevocably destroy it entire, it being necessary for accounting or tax audit, or even simply for mundane followup process, such as returns, that arise from actionable consumer rights. Ergo, such documents must fundamentally survive the erasure/redaction of any PII it does include.

> It is a rookie blunder to link them relationally to master data for products and PII &c.

Is it always? If that data is immutable, for example?

How are you going to satisfy data compliance, which may require the deletion of PII upon request or expiration, if your PII data is immutable?
Was just considering this when I came across your comment.

I'm hoping someone here can suggest a one-way audit-log audit-trail sort of solution, because I need this for the medical industry.

I would say the structure of the records themselves can stay, but not the data itself.

If you have a user table, maybe you can just randomly hash the sensitive user data but keep the record.

Not 100% sure about this though, since you could probably derive the user with statistics like: if it's known that 1 person gets a specific disease every 10 years and you have an obfuscated record of a person connected with that disease, it's fairly straightforward to derive who that person is just through that connection.

The way to do it is to have foreign keys, but instead of hard delete you scrub data in the columns.
Seems like if you can scrub data, it is mutable? Maybe I misunderstand what “immutable” means.
How are you going to respond to a warranty claim if you've wrongly deleted the order data when a subject requests that you delete all PII you have about them?

https://www.adyen.com/blog/gdpr-what-it-means-for-customer-p...

You store the sales contract separately, which is what this whole subthread is about.

https://news.ycombinator.com/item?id=27251168

I'm not sure how this changes anything. Is your PII in the order forms/sales contracts mentioned above? If yes, you'll have to delete those as well anyway, right? If it's not, the order forms/sales contracts themselves don't have to be linked to something that may potentially get deleted.

Please note that by "immutable" I don't mean that data won't get deleted eventually, just that it won't be deleted until nothing needs it anymore (and until then won't be mutated either), so basically the same thing that languages like Haskell mean by "immutable". Then, once you don't need it (= it's not observable anymore), it could perhaps get archived or erased, whatever you prefer.

Then it isn’t master data anymore; it’s just one field of a record of a commercial document. This is taking a long way around to the same point.

(Presumably no-one is trying to reference-count GC their RDBMS. If so, I wish them all the luck in the world.)

I'm not quite sure what "master data" means in English (a non-native language to me) but Wikipedia tells me that it's "data about the business entities that provide context for business transactions" (and lists examples that sound relevant for this situation to me). Based on that I'm inclined to think that this would qualify.
Hard deletes are also awful from the perspective of data preservation. For example, when youtube removes a video they also delete all the metadata or any indication that it ever existed. Countless people have lost what they thought was a secure record of at least the title of songs or videos they saved to a playlist.

There is also a more sinister side, which is that the ability to hard delete something forever means that bad actors can fabricate old "deleted" documents and accuse someone of having created and then deleted them.

I do think that hard deletes may sometimes be required to comply with legal requirements (e.g. complete expungement of personal information relating to a user). If it is not required by statutory law, sometimes it is written into commercial contracts.
Exactly. I get OP's point (i.e. you can accidentally include softdeleted records in your results), but for some types of data hard deletes are an absolute no-go anyways, so you just have to live with it.
> if that table's primary key is a foreign key in another table - imagine deleting a user and then having no idea who made an order

Assuming you have constraints set up correctly (on delete no action or on delete restrict) then how could this ever happen? If you don’t have constraints set up correctly…

Assuming you're deleting the row because it shouldn't be used by read queries, constraints like you described prevent the problem of having orphaned records in the child table but also prevent you from achieving your goal. On delete cascade would allow you to achieve your goal and prevent the orphaned records but could lead to deleting more than intended (especially if the child table is also a parent table referenced by further foreign key constraints, its children could in turn have children, etc). Of course, with no action/restrict you could also manually cascade the delete, but if you actually don't want to delete a child row and there's not an appropriate alternative value for its foreign key then you're in a bit of a pickle.

So if you want to delete a user but keep the records of their orders and still know who made those orders, then some form of soft delete is probably your best option. I believe that's the point rm999 was making (in response to the article asserting that soft deletes are a "data model mistake"). Properly configured constraints can prevent an "oops" but don't really do anything to solve the problem of this sort of delete from some contexts but not others.

The chance that you don't have constraints set up correctly is indistinguishable from 100%.
Foreign key constraints are a waste of effort for most at-scale web apps. They guard against a subset of problems that aren’t actually problems (orphaned rows) at a putative cost (for the db)
I disagree. Any fairly competent DBA will know how to setup the constraints correctly. It's not rocket science. If you can think logically enough to program, you can think logically enough to set up constraints correctly.
This should be the responsibility of the application developer creating the database schema and queries. A constraint is part of your application logic, not of the administration of the database.
I've yet to see anything I'd consider calling a startup having a DBA. I'm positively impressed if they even default to use foreign keys.
Not everybody has a DBA :(
Well, my team has never had an official DBA either. But we do all the tasks of a DBA. I've installed/upgraded, configured, tested backups, and hardened Oracle and Postgres more times than I can remember. We do all our own DML and DDL work. It really isn't hard to do it right.
That would just make the data loss problem worse still. I realise OP just chose an arbitrary example, but if you really are talking about users and orders, and if you delete a user, then really deleting the records for their associated orders is even worse than losing track of who made them.
I realize you might not be familiar with how database constraints work, but an on delete no action would totally prevent the user from being deleted. You literally cannot delete the user and their associated orders will definitely not be deleted.
GP did not suggest ON DELETE CASCADE.
Hard deletes most likely need to be supported, due to legal or contractual obligations. Designing with this in mind, makes everything a lot easier in the long run.
I’ve always NULL’d values, not deleted rows. E.g. GDPR request? NULL out all identifying information, but keep the record.

As long as your primary key has no business meaning you should never have to delete the row of a table.

INAL, but... you might want to revisit that code. article 17, right to erasure is about erasure of personal data, not about making non-indentifiable. of course they dont define erase or delete :-)

(edit: typo)

If you erase all identifying parts it stops meeting the definition of personal data. That should be sufficient.
well to me the transaction is the same as deleting a record and populating a NULL record.

I don't see why the law should care in any way about a company populating NULL records.

> I don't see why the law should care in any way about a company populating NULL records.

It cares if the existence of this record still leaks private data. This is why talking about generic "records" here is pretty wrong - actual data is not interchangeable "records" where you can just slap on a generic cargo cult policy and think you're done.

Different use-cases require different data handling. Although, I do agree, for most CRUD cases it's enough to NULL out rows.

"NULL out all identifying information" is anonymization, not deleting the information.
I wrote zeroes to my hard drive. Would you consider the data on my hard drive merely anonymized, or is it deleted?