| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by voidmain 3188 days ago

Bloom filters are a nice data structure, and you should absolutely have them in your toolbox, but if you go looking for a reason to use one you are likely to wind up making things worse. The following is not valid reasoning: "Bloom filters are efficient. Therefore if I can find a way to use a bloom filter, my solution will be efficient."

The "SSH keys" protocol in the article seems like an example of this. It doesn't make any sense. Why would the server send the client a Bloom filter if the client has already told it what key it wants to check? The server only has to send one bit back to the client! And if the goal is to not trust the server with the client's (public) key, this protocol doesn't accomplish that either.

And if you do for some reason have to transmit the entire database of compromised SSH keys in a way that permits only membership tests, a Bloom filter isn't the most compact way to do it! For example, off the top of my head, you could calculate an (15+N)-bit hash for each element of the list, sort the hashes, and rice code the deltas. That would take very roughly 32768 * (N+2) bits and give about 1 in 2^N false positives. So for N=13 it is about the size of the bloom filter in the article but gives a false positive rate 8 times lower. This data structure isn't random access like a Bloom filter, but that doesn't matter for something you are sending over the network (which is always O(N)).

3 comments

gopalv 3188 days ago

> The server only has to send one bit back to the client!

As much as the example is a bad one because it leaks server-side info to an unauthenticated client, I've had scenarios where if you have > 3 ssh keys in your key-chain all ssh login attempts fail after 3 keys are tried & cause failures. I end up writing ~/.ssh/config entries; a lot of them for the client to remember which key to try first.

My favourite real-life bloom-filter story is the "unsafe URLs" list that is in Chrome - the "Safe Browsing Bloom" is a neat way to send out obscured information about the bad URLs without actually handing out a list to a user. The web URLs or domains which find a match in this, do need to be checked upstream, but it avoids having to check for every single request with a central service.

On a similar note, been playing with a variant of bloom filters at work called a Bloom-1 filter [1] which works much faster than a regular bloom filter which has a lot of random memory access for 1 bit reads.

[1] - https://github.com/prasanthj/bloomfilter/blob/master/core/sr...

link

papercrane 3188 days ago

> The "SSH keys" protocol in the article seems like an example of this. It doesn't make any sense. Why would the server send the client a Bloom filter if the client has already told it what key it wants to check? The server only has to send one bit back to the client! And if the goal is to not trust the server with the client's (public) key, this protocol doesn't accomplish that either.

There is a footnote on the sequence diagram that the key is not sent to the server on the initial request. Rather the client just does a simple GET. Since it's just sending a static file the client could cache the bloom filter.

link

chrisweekly 3188 days ago

+1 insightful and educational -- in more ways than one!

> "sort the hashes, and rice code the deltas"

Being quite sure that wasn't a racial slur, I looked it up:

> "Rice coding (invented by Robert F. Rice) denotes using a subset of the family of Golomb codes to produce a simpler (but possibly suboptimal) prefix code. Rice used this set of codes in an adaptive coding scheme; "Rice coding" can refer either to that adaptive scheme or to using that subset of Golomb codes." (https://en.wikipedia.org/wiki/Golomb_coding)

link