Hacker News new | ask | show | jobs
by mdjasper 2590 days ago
There are a million reasons to agree that the data is never actually deleted

* need to retain data to fulfill government requests

* internal auditing

* it's all backed up in some "data lake" somewhere to do internal ml or analytics on

* hundreds of copies in database backups from different times

* internal logs that contain the data

* it's already been analyzed and aggregated into learning products and models that aren't going to be recomputed

It's not being "paranoid". As someone who has worked on large scale saas, I say: there is zero, 0, ZERO, 0.00 chance of that data every actually being deleted

4 comments

> As someone who has worked on large scale saas, I say: there is zero, 0, ZERO, 0.00 chance of that data every actually being deleted

Unless that is built by design. I happen to also work on a large scale SaaS where we take this stuff very seriously and I can say it is possible to protect this data. However I will agree that this adds considerable complexity, but for some organizations, that is totally worth it.

> need to retain data to fulfill government requests

That's a choice, not a requirement. If you encrypt the data and purposely don't store the keys yourself but instead have the customer store them, then you don't have anything of value for the government.

> internal auditing

Personally Identifiable Info is not something we want to peruse. In fact we purposely don't want to see it because that eliminates a potential for mishandling.

> it's all backed up in some "data lake" somewhere to do internal ml or analytics on

That kind of application shouldn't give carte blanche to disregard retention policies. You can run those applications against replicated shards of the original data; and when the original gets reclaimed, so does the replica.

> hundreds of copies in database backups from different times

Storing useless data forever is not cheap, especially at scale. Better store what needs to be stored and free up what can be freed when retention policies kick in (or user requests it).

> internal logs that contain the data

That's ground for failing certain compliance audits. Logs should never contain PII in the first place, that's an operational failure.

> it's already been analyzed and aggregated into learning products and models that aren't going to be recomputed

That's a tricky one, but if those are actual models instead of giant lookup tables, one could assume the data is not reconstructible. However, that needs to be a design consideration of the models themselves, to prevent user data from persisting.

Ever is a very very long time indeed. Facebook will go out of business or get sold eventually and delete it all to save money or by accident.

See also: Myspace accidentally dumping everything pre-2015 https://www.engadget.com/2019/03/18/myspace-lost-12-years-mu...

For similar reasons, I doubt very much that they actually deleted everything they said they did. I know that they were backing up to tape and storing it at Iron Mountain years prior.
But someone has to pay that bill. If Facebook ever fails, I doubt that someone would keep data centers full of information around without getting paid.
They would sell it with their other assets.
Right and deleting is usually expensive in large systems it's easier to thombstone and delete later
Say someone uploads pedophilia images to facebook. Does it ever get deleted ? Or stays in the datalake ?