Proof of calculating Minhash

minhash proof
minhash lsh
true permutation
minhash time complexity
weighted minhash
minhash lsh clustering

I'm reading about MinHash technique to estimate the similarity between 2 sets: Given set A and B, h is the hash function and hmin(S) is the minimum hash of set S, i.e. hmin(S)=min(h(s)) for s in S. We have the equation:

p(hmin(A)=hmin(B))=|A∩B| / |A∪B|

Which means the probability that minimum hash of A equals to minimum hash of B is the Jaccard similarity of A and B.

I am trying to prove above equation and come up with my own proof: for a∈A and b∈B such that h(a)=hmin(A) and h(b)=hmin(B). So, if hmin(A)=hmin(B) then h(a)=h(b). Assume that hash function h can hash keys to distinct hash value, so h(a)=h(b) if and only if a=b, which has a probability of |A∩B| / |A∪B|. However, my proof is not complete since hash function can return the same value for different keys. So, I'm asking for your help to find a proof which can be applied regardless the hash function.

I can't be sure what your exact question is.

But if you are looking for a method to prove:

probability that minimum hash of A equals to minimum hash of B is the Jaccard similarity of A and B.

Try having a look at section 3.3.3 of Mining of Massive Datasets, by Anand Rajaraman and Jeff Ullman

Proof of calculating MinHash, It seems like it is an underlying assumption in this formula that different elements map to distinct hashes. Otherwise, just have your hash function map everyone  Which means the probability that minimum hash of A equals to minimum hash of B is the Jaccard similarity of A and B. I am trying to prove above equation and come up with my own proof: for a∈A and b∈B such that h(a)=hmin(A) and h(b)=hmin(B). So, if hmin(A)=hmin(B) then h(a)=h(b).

Think of the hash function just as a mean to provide a random permutation of (A ∪ B). Now, think about that permutation.

Put every possible element of (A ∪ B) as a row in a table, using the permutation p you have chosen. And two columns A and B, like this:

A = {1, 3, 5, 6}
B = {2, 3, 4, 6}
p = {5, 6, 1, 2, 4, 3}

The table:

   A  B
5  1  0
6  1  1
1  1  0
2  0  1
4  0  1
3  1  1

There are only two types of rows, X: where A and B are 1. Y: where A != B

There are (A ∪ B) rows in total. But only (A ∩ B) rows of type Y. The chance that the first row is one of the type Y is Y/(X+Y). Or Pr[hmin(A) = hmin(B)] = (A ∩ B)/(A ∪ B).

This is exactly what the book Nilesh linked says, but I tried to explain with another example.

Proving calculating Minhash, Min-hash are not just (standard) hash functions, but a family of functions H, such that if you randomly pick one function h←H out of the family, it will satisfy the  Two sets are more similar (i.e. have relatively more members in common) when their Jaccard index is closer to 1. The goal of MinHash is to estimate J(A,B) quickly, without explicitly computing the intersection and union.

This can't be proved "regardless of the hash function". Just consider: you could use a very poor hash function that produces extremely frequent collisions (such as simply binary-ANDing all values together). MinHash would no longer approximate Jaccard similarity at all, but would report much higher similarities. Proofs of MinHash that I've seen have assumed that hash collisions will be rare enough to be insignificant.

[PDF] 5 Min Hashing - CS @ Utah, Although this gives us a single numeric score to compare similarity (or distance) it is not easy to compute, and will 0 otherwise. Lemma 5.3.1. Pr[m(Si) = m(Sj)] = E​[ ˆJS(Si,Sj)] = JS(Si,Sj). Proof. There are Algorithm 5.3.1 Min Hash on set S. Thanks for contributing an answer to Computer Science Stack Exchange! Please be sure to answer the question. Provide details and share your research! But avoid … Asking for help, clarification, or responding to other answers. Making statements based on opinion; back them up with references or personal experience. Use MathJax to format equations.

Assume collisions will never happen, or will be negligible. You just choose a length for your hashes such that the chance of them colliding becomes arbitrarily small. This article describes the bounds for various numbers of items and hash sizes. https://en.wikipedia.org/wiki/Birthday_attack

Minhash proof, A visual proof on 3 element distributions. Proving calculating Minhash. Ravazzi, T​. Thus, 1-bit MinHash improves the privacy guarantees. Proof. IPSN2017  Minhash proof. Claim: The expected number of actual updates (changes) of the MinHash sketch is O(𝑘 ln 𝑛) Proof: First Consider 𝒌=𝟏. g. Perhaps the same topic. P. similar to each other and only compute the distances between. In computer science and data mining, MinHash is a technique for quickly estimating how similar two sets are.

[PDF] locality sensitive hashing using minhash, Goal: compute a “signature” for each set, so that P(minhash(S) = minhash(T)) = SIM(S,T). Proof: X = rows with 1 for both S and T. Y = rows with either S or T  To calculate hmin (S), you pass every member of S through the hash function h, and find the member that g ives “ lowest result ”. Calculate hmin (S) for the set A and B. Suppose it turns out that for our chosen h, hmin (A)= hmin (B) (call the value HM). It is true : HM = hmin (A ∪ B)

MinHash, In computer science and data mining, MinHash is a technique for quickly estimating how similar two sets are. The scheme was invented by Andrei Broder (​1997)  The probability that two documents have at least one different MinHash value on any row of a given band is the complementary of the previous step $1-J(A,B)^r$ The probability that two documents have at least one different MinHash value on any row of any band is $(1-J(A,B)^r)^b$

[PDF] Jaccard Similarity 2 Revew: MinHash 3 Parity of MinHash, To compute a MinHash signature of a set A = {a1,a2, . Without proof, we state the following fact about the collision probability of L(·). 12-1  Proof: 3 types of rows X : 1 in both column --> count x Y : 1 in one column, 0 in other --> count y Z : 0 in both columns --> count z Jac(Si,Sj) = x/(x+y) and z >> x,y (mostly empty) ignore type Z.

Comments
  • Thanks for trying to contribute to stack overflow. Though the link might solve/answer the issue/question, its better to add a consolidated details out here to make the answer more clearer.