| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by maest 1460 days ago

> The probability of C also having true in the row would be equal to Jaccard’s similarity!

That's clear, call this P1

> The probability that two documents A and B having the same representative token is, equal again to Jaccard’s similarity

That's less clear (call this P2) and not equivalent to the first statement, afaict. In fact, this probability seems lower than the previous one. Consider the table:

    token     A     B
        a False  True
        b  True  True

This counts as matching under P1, but not under P2.

What am I missing here?

In order words, the number of cases where `reptoken(A) = reptoken(B)` is a subset of cases where `reptoken(A) is in B`

2 comments

yellowflash 1453 days ago

The probability of the `b` being chosen for A is 1 (No other choice) and for B it is 1/2. The probability of `b` being chosen for both (The only common token), is |{(b,b)}| / |{(a, a), {(a, b)}| = 1/2 which is jaccard's similarity |{b}|/|{a, b}|

I could have explained that a bit better I suppose.

link

kwillets 1460 days ago

The net result of the hashing etc. is to shuffle the unique elements of A union B. In that shuffled union, the first element is in at least one of A or B; if it's in both it's in the intersection. The chance of that is J.

link

maest 1459 days ago

Maybe I'm missing something, but I don't think that addresses my concerns around P1 and P2 being different.

link