|
|
|
Ask HN: Embeddings as "Semantic Hashes"
|
|
15 points
by DavidHaerer
953 days ago
|
|
As I understand it, embeddings are semantic representations of input data, such as text or images, in a vector space that maps conceptual meaning to distances. However, this vector space is only meaningful to the model. To draw an analogy, can we compare the model to a hashing algorithm and the embedding to the hash of the input data? If so, what is the equivalent of SHA256? How can we make embeddings future-proof and exchangeable between independent parties? |
|
This goes even further, as a model sophisticated enough to capture a probability distribution will produce embeddings that encode this distribution (to some extent) so that any two models of that kind produce "equivalent" embeddings that can be transformed into each other. This is an area of active research (in fact, I've just been to a seminar talk about that).
So the answer to the "How can we .." would be: by capturing the distribution, by making the embedding big enough and the training task difficult enough.
Examples of embeddings that are re-used are variants of word2vec, CLIP and CLAP.
As others have already mentioned: the hash analogy would be correct if you think about non-cryptographic hashes, but I doubt that this clarifies anything.