Hacker News new | ask | show | jobs
by FooBarBizBazz 245 days ago
Isn't this solved with salt?
4 comments

This is how I did it. You generate a salt per logging context and combine with the base into a sha2 hash. The idea is that you ruin the ability to correlate PII across multiple instances in different isolated activities. For example, if John Doe opened a new account and then added a co-owner after the fact, it wouldn't be possible for my team to determine that it was the same person from the perspective of our logs.

This isn't perfect, but there hasn't been a single customer (bank) that pushed back against it yet.

Salting does mostly solve the problem from an information theory standpoint. Correlation analysis is a borderline paranoia thing if you are practicing reasonable hygiene elsewhere.

If it's salted, you can't share it with a third-party and determine who your customers in common are. (That's the point of the salt; to mean that my_hash(X) != your_hash(X)).
You actually can join it when the salt provider is a dedication shared entity. The entity rehashes the data of both organizations to use a shared salt. That is how different organizations join hashed data.
> A 2020 MacBook Air can hash every North American phone number in four hours

If you added a salt, this would still allow you to reverse some particular hashed phone number in about 4 hours, it just wouldn't allow you to do all of them at the same time.

I do not agree. How will you reverse a salt with sufficient entropy? Imagine the salt is a 512 bit hex, the data is a ten decimal digit phone number, the generated hash is 512 bits of which the first 160 bits are used as the value. Now exactly how will you get the phone number back? Do you really think you can iterate over half of the possibilities of 512 bits in four hours?
You know the salt because it's stored alongside the hash. You're only iterating over the space of phone numbers.

If it's not stored alongside the hash it's not a salt, it's something else.

https://en.wikipedia.org/wiki/Salt_(cryptography)

> If it's not stored alongside the hash it's not a salt, it's something else.

That is not even true. The definition in the article does not substantiate it. There is no requirement for the salt to be stored alongside the hash.

The definition in the article is sufficiently clear. This is all that a salt is:

> a salt is random data fed as an additional input to a one-way function that hashes data

With regard to effective anonymization, the salt is stored by the generator, but not in the exported dataset.

If the "salt" is kept secret then I agree you can't brute force all the phone numbers so easily. But I don't agree that "salt" is the correct term for that technique.
A salt is very good if the input varies. If the input stays within a pre-defined range (e.g. phone numbers), salt does not work very well.
I do not agree that it doesn't work very well. How will you reverse a salt with sufficient entropy? Imagine the salt is a 512 bit hex, the data is a nine decimal digit SSN, the generated hash is 512 bits of which the first 160 bits are used as the value. Now exactly how is the salt not good enough?