概述
application
there are many data-mining problems which can be expressed as finding similar sets, such as:
- pages with similar words, e.g., for classification by topic
- recommendation systems, classification by people or movie
- entity resolution, different informs all point to one person

essential parts

shingling

minhashing
jaccard similarity measure

matrix represent

minihash
总体步骤

例子

性质

签名的相似性

implementation
将对矩阵的转置操作用不同的hash函数代替。

LSH
LSH means locality-sensitive hashing.
概述
- general idea: generate from the collection of all elements (signatures in our examples) a small list of candidates pairs : pairs of elements whose similarity must be evaluated.
- for signature matrices: hush columns to many buckets, and make elements of the same bucket candidate pairs.
候选pairs的规则

LSH具体阐述

例子

概率分析

总结

得到更多的signature(但是会有更多的空间占用与计算),可以有更大的b和r,能够获得更step的函数。


