Hacker News new | ask | show | jobs
by marcan_42 2307 days ago
This is bad benchmarking. There is no way you're doing a b-tree lookup in microseconds on an on-disk file... Unless the parts you care about are already cached.

So either the whole file fits in RAM and you pre-load it (in which case you have to account for that memory usage), or you have to run benchmarks on random hashes, which would yield much slower numbers (on the order of 30ms for an HDD).

Personally, when I implemented this in a web service, I used a bloom filter. It has some false positives (tunable) and requires a few extra disk reads per check, but the resulting file is also smaller and the code to generate it and check it is very, very simple.

https://gist.github.com/marcan/23e1ec416bf884dcd7f0e635ce5f2...

P.S. if you need to sort a huge file, just literally use the UNIX/Linux `sort` command. No, it does not load it all into RAM. It knows how to do chunked sorts, dump temp files into /tmp, and then merge them. Old school UNIX tools are smarter than you think.

6 comments

> P.S. if you need to sort a huge file, just literally use the UNIX/Linux `sort` command. No, it does not load it all into RAM. It knows how to do chunked sorts, dump temp files into /tmp, and then merge them. Old school UNIX tools are smarter than you think.

This so much.

I’ve worked with many devs and admins that don’t understand the tools that they have at their disposal on their systems. They end up trying to reinvent the wheel and their solutions usually don’t consider all the edge cases

Note that this does take a while; I let it go for about two hours before killing sort.

     time gsort --parallel=2 -o $HOME/pwned-pass-sorted.txt pwned-passwords-sha1-ordered-by-count-v5.txt
         2890.06 real      1400.86 user       165.54 sys

48mins

`gsort` is GNU sort on coreutils (not the one included on macOS).

This is on a Mac Mini 2011 (5,1) with the 2.3GHz i5.

They really have a lot of things built into these tools :)

> Personally, when I implemented this in a web service, I used a bloom filter.

When I was working on blocking leaked passwords, I found projects using bloomfilters for this (e.g. Keycloak). I find false positives completely unacceptable from a user experience perspective.

Blocking a perfectly fine password is playing with the users mind. If he cares even a little, he will wonder what is wrong with his password (was it leaked???) and if the answer "false positive" isn't available to him it's just evil.

A bloom filter is a good tool to filter out the negatives (which should hopefully be the majority of passwords), but please run all positives against the full set.

It depends what purpose you're using this for. If it's something like a password manager where you're checking the user's existing passwords to see if any have been breached, then yes, you should make sure not to hit false positives.

But if you're using it as a way to prevent people from using known-breached passwords on a site/service, it's really not worth worrying about. False positives would only be a problem when the user is using bad password practices anyway. If it's a non-shared, random password like it should be, a tiny chance of blocking an acceptable one is fine. They can just generate another one, it's an extremely minor inconvenience at worst.

Just include a note in the error message saying something like, "In very rare cases, this could be a false positive. Even if it is, you must choose a different password." The 0.1% of users it impacts (or whatever your error rate is) will be fine.

While I absolutely agree you shouldn't reuse passwords across services, it's a reality for many users and I'm convinced it's not an acceptable point of view to tell your user his password might be leaked, if there is no indication for it. This is not a minor inconvenience, this might trigger a major panic and it's inconsiderate to ignore it.

Even if you display there is a minor chance of a false positive, your user must now think 0,1% this was a false positive or 99,9% my password was leaked! I don't want users to panic if there is no reason to and I don't want users to blame false positives if there is reason to panic. I think it's definitely worth worrying about.

Also research has shown that forcing users to regularly change passwords leads to weaker passwords. And as many (most?) users are not using password managers (yet) expecting a different secure (i.e. long enough, non dictionary) password is too far from reality.

You probably shouldn't indicate to them that their password has been leaked.

The database only lists the SHA hash of passwords which have been found in various datasets and the occurrence count. It does not indicate that the matching password (one of half a billion) has ever been associated with the user account.

The fault in knowledge schemes is oversharing the secret. However, there is no way to know for sure that the user has done this by reusing passwords - you will have false positives (say, one other person on the internet chose this password by chance) and false negatives (the user has reused the password all over the place, but has managed to not be part of a known breach dataset).

The false positive rate I tuned for was lower than the total number of users, so chances are nobody ran into that problem.

If this were a bigger service, yes, I would've done an exact check afterwards :-)

Exactly what I thought. Microseconds? On HDD? That would mean the block is in RAM.

HDD latency is composed of seek + rotation latency + processing time, on the best of the best server HDDs random read might be in the ballpark of 10ms (10,000μs), desktop HDD... maybe around 30-40ms (30,000μs). Or ~600x times more than 49μs.

Desktop HDDs aren't that bad, but 100 IOPS is a good ballpark for back of the envelope estimates (10ms). The OP wrote their B-Tree to require 3 lookups, so assuming they really are single disk reads that's where I calculated 30ms.
Just to confirm, the benchmarks are bad. Almost always the read parts of B-tree file are cached. I'll rewrite the benchmarks and update the results!
Thanks for pointing this out.

You're right. As I said in some comment up there, the benchmarks are bad. I benchmarked the best and the worst case in scope of the original file, so I look up for the first and the last hash.

I totally missed that if I look for the same hash over and over again, I'd end up reading the same B-tree files parts, so they can be easily cached. That's probably the reason it seemed so fast.

I'll rewrite benchmarks and update the results.

About the Bloom filter, I thought about that but I wanted a solution that is 100% correct. The filter doesn't guarantee that.

About the sorting. I wanted to implement a stand-alone library that does everything for you and that is cross-platform. That's why everything is implemented, without dependencies/usage of other tools.

> Personally, when I implemented this in a web service, I used a bloom filter.

I'm not a professional programmer, but I would have supposed a bloom filter is in this case equivalent to just saving the first n bits of every sha1 hash. Is that wrong?

You're right. A bloom filter is a random projection of the input domain. A random projection of a uniformly-populated random input is not useful. A large set of password hashes should be already maximally entropic.
A bloom filter is different here. Imagine we're working with a fixed 512MB of RAM, which we're going to use as a bitset addressing a range of [0 ... 2^32). An over-estimate of the 22.8GB hash file at 40 characters per hash gives that it it contains about 2^29 hashes, so if we store the 32-bit prefix of each hash in this table, we expect it to be about (2^29 / 2^32) = 1/8th full.

We can use this set of prefixes to filter the data. Given an input hash, check whether its prefix is in the table. If it's not in the table, the prefix is certainly not in the file. If the prefix is in the table, there is a 1/8 ~ 13% chance of a "false positive" where the query is not actually in the file, but the set thinks it is.

What a Bloom filter does is start using the rest of the hash, by storing the presence of the next 32-bits of the hash in the same set, and so on. By storing the next 32 bits as well, we can drop the false positive rate to 6%. By using the next chunk of 32 bits, it drops again to 4%, and finally when storing 5 32-bit chunks (very conveniently, we have at this point "used up" all 32-bit chunks of our original 160 bits) to around 2.5%. All the while, the bloom filter is fitting into the same 512MB we originally gave it.

Thanks. I assumed it was more densely populated than that, but now that I think about it it's obviously a human-scale set of passwords.
No, a bloom filter requires k random projections of the input domain over a domain of m elements. In my smaller sample configuration, k=11 and m~=5000000000. log2(5000000000^11) ~= 354 bits (and you need a few more for padding to get better uniformity). SHA-1 hashes are 160 bits, so you can't use the original hash directly to index into the bloom bitmap. You need to hash again at least twice (or once with something like SHA-512).

But really, hashing is so cheap there's no point in trying to be clever like this for an on-disk implementation. My code is designed to take any strings as input, whether they are HIBP hex hashes already or something else. If this were intended to be a high performance in-memory implementation used for a database or something, the constraints would be different.

Thanks! That's what I was thinking.
It's not just saving a few bits. A bloom filter requires a lot less storage than just chopping the hashes for the same false positive performance, because it uses multiple hashing steps (in my default config, 11, but this is tunable depending on false positive rate and input size), and all of them have to hit bits that are set in the filter for the match to be positive.

Pwned passwords 2.0 was half a billion passwords. That's 29 bits. To get a 0.05% false positive rate you need an additional 11 bits or so. That's 40 bits per hash, or 20 billion bits total, and then you need to binary search it (29 lookups) or make it into a tree and add bookkeeping structures.

The equivalent performance bloom filter only takes 5 billion bits of storage and 11 single bit checks in a bitmap.

I use a Bloom filter for this as well, but in Redis using the RedisBloom module. This was really easy to set up, but it does mean that it requires about 1GB of RAM or so (depending on error rate) dedicated to it, so it wouldn't be ideal if you're constrained for RAM.

I run a separate Redis server specifically for this purpose, which means when I want to update the list, I can build the Bloom filter on my local machine, and then just transfer the RDB file to the server and replace the previous one.

Here's the Python code for the simple CLI tool to initialize and add the hashes from the HIBP files to the Bloom filter if anyone's curious: https://gitlab.com/tildes/tildes/-/blob/master/tildes/script...

To check if a password's in the list, you just SHA-1 hash it and send a Redis command like:

    BF.EXISTS breached_passwords_bloom <sha1>