|
|
|
|
|
by quantadev
397 days ago
|
|
Yes, some hash functions are intended to have collisions (like hash algorithms that are designed to put things into 'buckets' for searching for example). And you're correct to notice that by mentioning improbability of collision I'm talking about strong hashes in that sentence. But you can take my words literally nonetheless. When I say "hash" I mean all kinds of hashes. Strong and weak. The existence of weaker hash algos actually moves you further away from your assertion (that semantic vectors are hashes) than closer to it. Weak hashes is about a small finite number of buckets in one dimension. Semantic vectors are an infinite continuum of higher dimensional space locations. These two concepts are therefore the exact opposite. |
|
Hashes are often assumed to be 1d, discrete valued, deterministic, uniformly distributed, and hard-to-reverse. And embeddings are often assumed to have semantic structure. Those two things certainly have some pretty different properties.
In the strict definitions, I’d say if hashing is just mapping to a fixed-size output space and an embedding is a projection/mapping of one space onto another (usually smaller) space, then they’re similar.
Some hash algorithms like SimHash or LSH use random projection onto sets of random hyperplanes to produce output vectors. Blurring the lines fairly well. You could even implement that as a NN with a single projection layer. Or indeed the torch.nn.Embedding class. Of course the outputs are usually then quantized or even binarized, but that’s more a use-case specific performance optimization not fundamental (and sometimes so are embeddings).