Hacker News new | ask | show | jobs
by ComputerGuru 2307 days ago
Honestly they’re being super mathematical about it but it’s really nothing fancy at all. SHA is designed to be used like this, CloudFlare hasn’t done anything remotely ingenious (and I wouldn’t mind except they go out of their way to talk about how much better their fancy new algorithm is compared to other multi-set intersection theories).

E.g. 512-bit SHA-2 and SHA-3 may be truncated at 128 or 256 bits if that’s all the entropy you need (and you don’t need to be compatible with the formal SHA2/SHA3-512/256 spec). Here, CloudFlare is truncating to an intentionally low entropy of just 20 bits, not to reduce the security but rather to intentionally increase the collisions. It’s ultimately just a glorified hash table and as any CS student can tell you, the bucket size is just a function of the hash size (v1 of the api: 128-bit hash, v2 of the api: 20-bit hash).

(I don’t like pretension.)

2 comments

It still is pretty clever and non trivial to come up with.

I bet most people who haven't thought about it beforehand or heard the solution, would fail an interview question as "So, we have a database of 100s Million passwords and we want the user to check if his password is inside; without the user actually sending his password.. or his hash; without us outright revealing all our passwords/hashes to the user."

Their solution is a pretty good tradeoff imho.

This is pretty much the case for probabilistic membership functions. It's kind of a broken bloom filter where you take only the front few bits without hashing. I wouldn't expect someone to come up with that under the interview stress, but if you know a few basic algorithms, this should be pretty simple/familiar.

It's also similar to how we spread mail folders in the past to avoid large directories. Or how Debian splits repo by package prefix: http://ftp.us.debian.org/debian/pool/main/

Exactly and good examples. I believe the general concept is just sharding, for anyone that wants to google it further. Typically naive sharding on the first few bytes is useless because it results in uneven distributions but (and this is what I mean when I said SHA was designed for this) a cryptographic hash should result in a uniform random-appearing distribution across all its bits, meaning the resulting hash is uniquely suitable for basic sharding by prefix.

But then they wouldn’t have been able to give it a fancy name if they called it what it is.

For someone who claims to not like pretension (sic), you're being pretty pretentious.
After seeing this I reread the post you're replying to. I don't see the pretention.

Pretentious: adjective. characterized by assumption of dignity or importance, especially when exaggerated or undeserved: a pretentious, self-important waiter. making an exaggerated outward show; ostentatious. full of pretense or pretension; having no factual basis; false.

If you're suggesting that their analysis is false, you should probably point out why. The self importance part I'm not seeing. Aren't they just stating the facts as they see them?

The "really clever design" as noted by simonw was called "nothing fancy at all" by ComputerGuru, and that haveibeenpwned's API design "hasn’t done anything remotely ingenious", this is what made the comment pretentious. (in this case it doesn't matter wether he is or not factually correct)

The reply could very well be interpreted as "You think that Kk-Anonymity is fancy? How primitive of you. This way is the natural way of doing that thing and I would know because I'm the smartest person in the room. Oh by the way I dislike people who are pretentious."

I’m sorry if that’s how you understood it. This is not about the HIBP api but rather the braggadocio in Cloudflare’s post. I meant “they are purposely obscuring how obvious and intuitive their solution is behind all this math so they can seem extra smart, and it’s a shame because it would have been a better and more actually educative post if it just said ‘we did something very obvious and maybe you can just as easily do the same to your own data without being a math geek and it’s no big deal at all that we did, but here’s how we did it anyway.’”
You're mis-representing, because nowhere did ComputerGuru say what you're claiming you quoted, of HIBP.

The not-remotely ingenious part was regarding Cloudflare.

Consider the possibility that those of us who found it pretentious aren't lying, and that you not seeing it is perhaps more related to how you personally see things.
Now you're mis-representing my comment, I didn't anywhere discuss whether or not it was pretentious, I just pointed out that the comment was factually incorrect.
Phrases like "honestly", "nothing fancy", "glorified", "any cs student" all convey (at least to this native English speaker) that whatever is being discussed is being looked down upon.
Right, so the post could be called "dismissive". It's hard to see it as openly glorifying anything though, so the charge of pretentiousness only works of one adopts some extra context.

If you assume that everyone who is dismissive is actually wrong and just pretending to know stuff, then I guess being dismissive is pretentious.

Sometimes it's the only way to look at things...
The analysis might not be false - but it sure is dismissive. Dismissiveness that jars with the stated dislike of 'pretension' and that makes the poster sound pretentious themselves.
You seem to be saying that one cannot be sure of ones own position and not be pretentious, is that what you mean?
No. What I’m trying to say is that any comment on a public forum does not exist in a vacuum and acquires additional properties based on where and when it is posted.
Uh... then you should probably edit your post, because I find it impossible to read this more general statement in the first post.
More pompous than pretentious.