Hacker News new | ask | show | jobs
by simonw 2308 days ago
This is a really informative write-up and an excellent learning exercise.

It's worth noting that haveibeenpwned's API has a really clever design for allowing people to look up their passwords without transmitting them to the site.

It's explained here: https://www.troyhunt.com/ive-just-launched-pwned-passwords-v...

The short version is that you can take the first 5 characters of a SHA-1 hash and hit this endpoint:

https://api.pwnedpasswords.com/range/21BD1

The endpoint returns (right now) a list of 528 full hashes along with counts. You can compare your full calculated SHA-1 hash to that list to see if the password is present in the dump.

The trick here is called k-Anonymity - I think it's a really elegant solution. This technique is written up in more detail here: https://blog.cloudflare.com/validating-leaked-passwords-with...

8 comments

Honestly they’re being super mathematical about it but it’s really nothing fancy at all. SHA is designed to be used like this, CloudFlare hasn’t done anything remotely ingenious (and I wouldn’t mind except they go out of their way to talk about how much better their fancy new algorithm is compared to other multi-set intersection theories).

E.g. 512-bit SHA-2 and SHA-3 may be truncated at 128 or 256 bits if that’s all the entropy you need (and you don’t need to be compatible with the formal SHA2/SHA3-512/256 spec). Here, CloudFlare is truncating to an intentionally low entropy of just 20 bits, not to reduce the security but rather to intentionally increase the collisions. It’s ultimately just a glorified hash table and as any CS student can tell you, the bucket size is just a function of the hash size (v1 of the api: 128-bit hash, v2 of the api: 20-bit hash).

(I don’t like pretension.)

It still is pretty clever and non trivial to come up with.

I bet most people who haven't thought about it beforehand or heard the solution, would fail an interview question as "So, we have a database of 100s Million passwords and we want the user to check if his password is inside; without the user actually sending his password.. or his hash; without us outright revealing all our passwords/hashes to the user."

Their solution is a pretty good tradeoff imho.

This is pretty much the case for probabilistic membership functions. It's kind of a broken bloom filter where you take only the front few bits without hashing. I wouldn't expect someone to come up with that under the interview stress, but if you know a few basic algorithms, this should be pretty simple/familiar.

It's also similar to how we spread mail folders in the past to avoid large directories. Or how Debian splits repo by package prefix: http://ftp.us.debian.org/debian/pool/main/

Exactly and good examples. I believe the general concept is just sharding, for anyone that wants to google it further. Typically naive sharding on the first few bytes is useless because it results in uneven distributions but (and this is what I mean when I said SHA was designed for this) a cryptographic hash should result in a uniform random-appearing distribution across all its bits, meaning the resulting hash is uniquely suitable for basic sharding by prefix.

But then they wouldn’t have been able to give it a fancy name if they called it what it is.

For someone who claims to not like pretension (sic), you're being pretty pretentious.
After seeing this I reread the post you're replying to. I don't see the pretention.

Pretentious: adjective. characterized by assumption of dignity or importance, especially when exaggerated or undeserved: a pretentious, self-important waiter. making an exaggerated outward show; ostentatious. full of pretense or pretension; having no factual basis; false.

If you're suggesting that their analysis is false, you should probably point out why. The self importance part I'm not seeing. Aren't they just stating the facts as they see them?

The "really clever design" as noted by simonw was called "nothing fancy at all" by ComputerGuru, and that haveibeenpwned's API design "hasn’t done anything remotely ingenious", this is what made the comment pretentious. (in this case it doesn't matter wether he is or not factually correct)

The reply could very well be interpreted as "You think that Kk-Anonymity is fancy? How primitive of you. This way is the natural way of doing that thing and I would know because I'm the smartest person in the room. Oh by the way I dislike people who are pretentious."

I’m sorry if that’s how you understood it. This is not about the HIBP api but rather the braggadocio in Cloudflare’s post. I meant “they are purposely obscuring how obvious and intuitive their solution is behind all this math so they can seem extra smart, and it’s a shame because it would have been a better and more actually educative post if it just said ‘we did something very obvious and maybe you can just as easily do the same to your own data without being a math geek and it’s no big deal at all that we did, but here’s how we did it anyway.’”
You're mis-representing, because nowhere did ComputerGuru say what you're claiming you quoted, of HIBP.

The not-remotely ingenious part was regarding Cloudflare.

Consider the possibility that those of us who found it pretentious aren't lying, and that you not seeing it is perhaps more related to how you personally see things.
Phrases like "honestly", "nothing fancy", "glorified", "any cs student" all convey (at least to this native English speaker) that whatever is being discussed is being looked down upon.
Right, so the post could be called "dismissive". It's hard to see it as openly glorifying anything though, so the charge of pretentiousness only works of one adopts some extra context.

If you assume that everyone who is dismissive is actually wrong and just pretending to know stuff, then I guess being dismissive is pretentious.

Sometimes it's the only way to look at things...
The analysis might not be false - but it sure is dismissive. Dismissiveness that jars with the stated dislike of 'pretension' and that makes the poster sound pretentious themselves.
You seem to be saying that one cannot be sure of ones own position and not be pretentious, is that what you mean?
No. What I’m trying to say is that any comment on a public forum does not exist in a vacuum and acquires additional properties based on where and when it is posted.
More pompous than pretentious.
That's actually pretty similar to how Google's safe browsing (used by Chrome, Firefox, Safari but not Edge) works. Instead of sending Google your full URL (although that's an option), you can make some transformations, SHA-256 the result and send the first few bytes. The server then replies with the full hashes matching these prefixes. Then you can check whether your URL is on the list. Very similar.
The core trick (and difference) of Safe Browsing is that you don't send stuff most of the time. Safe Browsing clients all download the same summary information which tells them which prefixes might have unsafe hashes. Most sites you visit will not match any unsafe prefix and so your browser doesn't call Google at all.

Pwned Passwords chooses prefixes short enough that any password you wonder about will cause a prefix to be looked up that has lots of Pwned Passwords in it. Was one of them yours? Only you know, this is k-Anonymity.

Safe Browsing chooses prefixes long enough that many sites you look at won't match anything at all. There is still arguably k-Anonymity because the total number of possible URLs is so vast, but that's not their main goal.

I was thinking, with all the recent discussion about the privacy problems with DNS (over https) would it be possible to use the k-Anonymity algorithm to build a dns server where you request dns details based on a sha-1 prefix of the domain you are looking up?

I have a suspicion the answers is no due to the way DNS lookup is recursive and so all dns servers would have to implement the protocol. Would there be a way to make this work?

I tried https://api.pwnedpasswords.com/range/e38ad

And sure enough:

214943DAAD1D64C102FAEC29DE4AFE9DA3D:2413945

Wow, 2.4 million for "password1". Apparently "123456" is the most common password of all time. Even random phrases and memes I tried were pwned, like "The Lord of the Rings" and "correcthorsebatterystaple"

I know you meant the word colloquially, but those aren't random phrases and that's why they're in the dataset.

"The Lord of the Rings" and "correcthorsebatterystaple" match but similar truly random phrases do not, I picked these without looking:

This Duke and Some Circles

impossiblegoosecheesebinder

Sure enough neither of them was in the dataset.

> Sure enough neither of them was in the dataset.

Give it a week...

Worth noting that Tr0ub4dor&3 wasn't pwned, unlike "correct horse battery staple", according to https://haveibeenpwned.com/Passwords :-)
> It's worth noting that haveibeenpwned's API has a really clever design for allowing people to look up their passwords without transmitting them to the site.

If you use the API to look up one of you passwords and it turns out to be pretty poor, then the service knows you're highly likely to have been looking up the top (by count) result, so now has the password and your IP address. I appreciate this service provider is well respected, but still, this is also worth noting.

k-Anonymity is not particularly clever. Every time you use a cryptographic hash to look something up you have a tradeoff around the length of the hash.

A long hash will identify your password very precisely.

A very short hash e.g. down to 1 byte will have so many matches to be useless.

Cloudflare chose:

> For example; the Hash Prefix 21BD1 contains 475 seemingly unrelated passwords, including:

...and this allows attackers to easily create a list of short hashes of common passwords and try them against matching accounts, as they point out:

> It is important to note that where a user's password is already breached, an API call for a specific range of breached passwords can reduce the search candidates used in a brute-force attack

It's clever in the way that if your password does not appear in the dataset the service doesn't have much information (or really any information that could be used to get your password).
Why the hashes I see at this endpoint don't start with the 5 digits I given as an argument?
I assume they've been omitted from the output (each hash is 35 characters, but a full one should be 40). This of course reduces the size of the result, conserving bandwidth.
Thank you for these resources. Truly ingenious.