Hacker News new | ask | show | jobs
by ww520 3172 days ago
"A Bloom filter is a data structure that may report it contains an item that it does not (a false positive), but is guaranteed to report correctly if it contains the item (“no false negatives”)."

I'm afraid that is not how it works. A Bloom filter can tell whether an item may be in the set (false positive) but can definitely tell an item is NOT in the set (no false negative).

4 comments

It's correct; you've misread it — in fact, you're agreeing with it. What it's saying is, if the item is in the set, the Bloom filter is guaranteed to report that it is in the set. What you're saying is, if the filter says the item is not in the set, then it is guaranteed not to be in the set. Those two statements are equivalent (being contrapositives: "A implies B" is equivalent to "not B implies not A").
The article states a Bloom filter is "guaranteed to report correctly if it contains the item", however a Bloom filter cannot do this. The bits set by the various hash functions could very well be set due to some other key. What the Bloom filter _can_ say, is that if none of the bits are set, then clearly the key was never inserted.
> The article states a Bloom filter is "guaranteed to report correctly if it contains the item", however a Bloom filter cannot do this.

I think what's going on here is that you're reading "if" as meaning "whether". In fairness, this is common English usage: "I'll tell you tomorrow if I'm going" usually means "I'll tell you tomorrow whether I'm going".

That's not what the OP means. The correct reading is perhaps clarified with a comma:

> A Bloom filter is guaranteed to report correctly, if it contains the item.

or perhaps better

> If it contains the item, a Bloom filter is guaranteed to report correctly.

Ah yes, that would be it.

IMO it is better to present the Bloom filter by asking/stating what one can be certain of given each return value of the Bloom filter.

I think it might just be a strangely formed sentence where the part "if it contains the item" is not the subject of the report, but a condition:

If it contains the item -> reports that it contains the item (correctly)

If it doesn't contain the item -> may report that it contains the item (incorrectly)

Nothing wrong with being pedantic when dealing with definitions.
Since we're on the topic of being pedantic, "well, you know, that's just, like... your opinion, man". I'd argue that there's nothing excessive about arguing if A may, in fact, be the opposite of A :)
Certainly not, I'm all for some arguing on the internet.
The logic you wrote is exactly the _opposite_ of what a Bloom filter does.
"If the item is in the set, the Bloom filter is guaranteed to report that it is in the set." This statement while correct is not useful since the query is against the Bloom filter, not the set. The Bloom filter reporting the object's hash key is in the filter does NOT mean the object is in the set. Multiple objects can have the same hashkey. A check of the hashkey on the Bloom filter just means ONE of these objects is in the set. The existence of a hashkey in the Bloom filter does not give much useful information.
He is saying "contains implies report". You are saying "not-report implies not-contain".

These are logically equivalent.

They are not. Multiple objects can map to the same hash key due to collision. Just because a hashtable containing the hash key of an object does not mean the object has been seen before. An object's hash key not contained in the hashtable definitely assures that the object has not been seen before.
This is called the contrapositive of your statement. It's logically guaranteed.

No one is claiming that "if the hash matches the object must be there". They're saying "if the object is there, the hash will match". This is logically equivalent to saying "if the hash does not match, the object is not there".

GP is right; you need to be careful about how you read it.

“contains implies report” is different than “report implies contains”. You are (correctly) arguing the against latter statement, which GP does not make.

I believe it is a typo. The rest of the article does align to that concept.

At first glance I do disagree that his/her solution will be as efficient in terms of size of the set and as performant as bloom filters or cuckoo filters. But I would have to benchmark it to be sure.

I can see what's confusing you, but it's correct as stated.

Item is not a reference to the physical data structure in memory. If you treat it as a black box where you put items in and then ask it if it has seen the 'item' then the wording is more clear.