| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by CydeWeys 1589 days ago
	The data protection regulations really are so freeing, huh. It's amazing to be able to delete all this stuff without worrying about having to keep it forever.

3 comments

jeff_vader 1589 days ago

In case of my previous employer it led to incredibly complicated encryption system. It took couple years to maybe implement in 10% of the system. Deleting any old data was rejected.

link

hinkley 1589 days ago

I wonder sometimes if it would help if we collectively watched more anti-hoarding shows, in order to see how the consultants convince their customers they can get rid of stuff.

link

mro_name 1589 days ago

humans started their first 300k years as nomads – storing was just impossible and decrufing happened by itself when moving along.

So maybe that's why we're not good at it yet.

link

hinkley 1589 days ago

Being a renter definitely kept me lighter for a long time.

When you have to box things up over and over you find that the physical and mental energy around keeping it aren’t adding up. I wonder if migrating from cloud to cloud would simulate this experience.

link

Bayart 1589 days ago

Being a renter just taught me to batch my $STUFF I/O to minimize read-writes to disk and maximize available low-latency space. ie. fill my bags to the brim with shit I didn't plan using whenever I'd go to my parents'.

link

travisgriggs 1589 days ago

Two space garbage collector in action right there. Maybe all things software need a "move it or lose it" impetus. Features in apps, old data, you name it. If you've gotta keep transferring/translating it, it would definitely pare things down.

link

mro_name 1588 days ago

maybe reuse is inferior to re-implement. Moderately re-inventing wheels may be beneficial. What may be a threshold?

link

fomine3 1588 days ago

Also hoarding digital data is far easier than real. I wish I could have grep on real space.

link

stingraycharles 1589 days ago

How is encryption compliant? I’ve implemented GDPR data infrastructures twice now, and as far as I’m aware, the only way to be compliant with encryption is when you throw the decryption key away.

link

aeyes 1589 days ago

Sometimes it might be a single field in a 1MB nested structure that you have to remove. So it gets encrypted when the whole structure gets stored and when the field is to be deleted you just throw away the key instead of modifying the entire 1MB just to remove a few kB.

link

dylan604 1589 days ago

If you're comparing gov't regulations to delete data to saving a few KB, then I think you're looking at this wrong.

link

viraptor 1589 days ago

It's few KB per-record. In practice when schemes like that are applied, it means "in total we can remove this key and not rewrite 10M rows across 3 data stores which itself would cost $$$ and make the database and incremental backups cry".

link

ByteJockey 1589 days ago

Bingo.

We did a similar thing except replacing the values with a UUID and storing the pair in a lookup table somewhere. Delete that row and none of the rest of the data is able to be tied back to a human being.

Bonus, most people didn't need that data, and it was no longer given out to everyone who grabbed the entire dataset.

link

spelunker 1589 days ago

As mentioned, encrypt something and throw a way the key, often called "crypto shredding".

link

stingraycharles 1589 days ago

Ahh I see, and that way you can quickly “remove” a whole lot of data by just removing the key, which makes for cheap operations, and/or more flexible workflow (you can periodically compact the database and remove entries for which you have no key).

Is my understanding correct?

link

dalyons 1589 days ago

yes, but also its that a lot of the data these days ends up in pseudo-append-only stores (like s3/glacier, or many data warehouse products) where deletes/updates to old data are extremely expensive. Or just having to scan petabytes of cold stored data looking for a particular users records. Throwing away the key is instant and "free".

link

chrisjc 1588 days ago

Interesting... this raises soooo many questions.

How are "crypto-shredding" actions propagated to the access patterns/layer?

I assume that there is an encrypted partition/cluster/shard key (in addition to similarly encrypted rows/fields) that is invalidated during the shredding causing any predicate matching on these ids to evaluate to false.

---

Now that I've typed this out, i realize that by electing to encrypt individual fields, all and any predicate matching will evaluate to false and has nothing to do with partitioning, sharding, or clustering...

I guess it would also be pretty awesome since you could invalidated entire sets of data by "shredding" grouping ids that are being used as partition/cluster/shard keys.

Now I realize that this implies that you shouldn't encrypt each and every fields of related data the same way (grouping ids), otherwise you're potentially going to end up with unique keys/ids for common attributes across sets of data... potentially rendering clustering/sharding/partition useless (cardinality too great).

While "defragging" or "rebalancing" this increasingly "sparse", old data would be expensive, surely there has to come a point where the storage costs start to exceed that of interaction costs for specific subsets of your prefixes. For instance, partitions that consist entirely of data that has had all of its respective encryption keys shredded.

---

Illuminating comment that has set my mind into overdrive... Fascinating stuff!

link

jhgb 1588 days ago

That doesn't sound like something jeff_vader was talking about, since "deleting any old data was rejected" and this is definitely a way of deleting stuff.

link

theshrike79 1589 days ago

Yep, having everything disappear at 2 months max is a life-saver.

That "absolutely essential thing" isn't essential any more when there is a possible GDPR/CCPA violation with a significant fine just around the corner.

link

koolba 1589 days ago

Just make sure you actually test your backups. Two months of unusable backups are just as useful as no backups.

link

marcosdumay 1589 days ago

Well, you should have done this before GDPR too, but reminding people to test backups is never too late and never too often.

link

whimsicalism 1589 days ago

now this is a spin i havent heard before.

link

jabroni_salad 1589 days ago

As a sysadmin I really wish you had. SO MANY problems have come to my desk because some dude 3 years ago did not consider retention or rotation and now I have to figure out what to do with a 4TB .txt that is apparently important.

link

briffle 1589 days ago

"You never know when you might need this info to debug" The developer says as their cronjob creates a 250MB csv file, and a few MB of debug logs per day, for the past few years. "Disk is cheap" they say.

As a sysadmin, I hate that too.

link

whimsicalism 1589 days ago

sometimes the data is just big...

link

colechristensen 1589 days ago

Often a considerable portion of those logs are useless, trace level misclassified as info, kept for years for no reason.

You should keep a minimal set of logs necessary for audit, logs for errors which are actually errors, and logs for things which happen unexpectedly.

What people do keep are logs for everything which happens, almost all of which is never a surprise.

One needs to go through logs periodically and purge the logging code for every kind of message which doesn’t spark joy, I mean seem like it would ever be useful to know.

link

whimsicalism 1589 days ago

sure, in a world where machine learning doesnt exist i would agree with you. for low level logs of things like "memory low, spawning a new container" i would also agree with you. not for user actions though (which is the topic closest to whats under discussion given what sort of data these regulations cover)

link

dylan604 1589 days ago

Find out how important it is with a `mv 4TB.txt 4TB.old` type of things. See how many people come screaming

link

chrisjc 1588 days ago

Have you come up with a process, or an idea for a process to ensure this doesn't happen?

For instance when they create a provisioning request, are you able to set an extremely low threshold? When they say that won't do, the cost increases and their able to see/understand and start to care about the actual lifecycles of what they're creating?

Surely there is a way to project and monitor the cost of their resources over time, and deliver them an invoice on a regular basis? In other words something like a cost attribution model? That way when the bills start to increase dramatically overtime, pinpointing the heavy hitters becomes trivial, and when they come knocking on your door to "do something about it" you can just say "go talk to Bob".

I don't mean to sound like I'm trivializing the problem (honestly I can relate as I've gone through it myself), but I'd love to hear how anyone else has dealt with this issue effectively.

link

jabroni_salad 1588 days ago

It comes down to monitoring, alerting, and followup. In other words, "good ops", which is lacking almost everywhere. Unfortunately that is always a moving target, with added complexity being that we're an external service provider and have limited authority in the client environment. Also, the sorts of companies that outsource their ops will also be willing to change providers multiple times, so it's often like trying to live in a library that has seen many generations of librarians each with their own ideas for how things ought to be organized.

link

hvs 1589 days ago

You haven't heard it because it's not spin, it's from an engineer's point of view. That's not the view you hear in the news when it comes to these things.

link

whimsicalism 1589 days ago

HN seems like an odd place to assume that people only hear about things from the news and aren't engineers themselves.

i am a dev that has to deal with these regulations in my day to day. it is a pain, it is not freeing in any sense, and it makes my models worse.

granted, i think there are good reasons for it, but it does not make my life easier for sure.

link

alisonkisk 1589 days ago

Eh, Retention and Deletion are both pain for devs. Not having to care is the happy state.

link