Hacker News new | ask | show | jobs
by nighthawk454 396 days ago
No, mapping to a _fixed length code_, that’s it. Note that the output domain need not be smaller than the input domain either.

If your model takes a sequence of 1 or 10,000 or N tokens and returns a vector of fixed length, say 1500 dimensions, then it is a hash function of sorts.

> A hash function is any function that can be used to map data of arbitrary size to fixed-size values

https://en.m.wikipedia.org/wiki/Hash_function

1 comments

I mean you can even look at this from an entropy perspective. A good hash algo generates pure noise (high entropy), while a semantic vector generates structure (low entropy). These two concepts are as far apart as anything could be, no matter what metric of comparison you choose to use. You literally couldn't name two concepts that are further apart if you tried.
Frankly, you’re ignoring the definition at this point. A “good hash algo” only generates noise in the cryptographic hash sense. There are in fact other hashes. The fact that “semantic vectors” preserve a useful similarity is no different mathematically then LSH or many others (except that the models work a lot more usefully).

If you’re trying to say MD5 isn’t an LLM, then fine no argument there. But otherwise consider referencing something other than vibes or magic, because the definition is clear. “Semantic vectors” isn’t some keyword to be invoked that just generates entropy from the void.

Oh, I get your argument. You think all functions that have finite output length for [usually] longer input length, are hashes. I totally get what you're saying, and it necessarily also means every LLM "inference" is actually a hashing algo too, as you noticed yourself, tellingly. So taking a long set of input tokens, and predicting the next token is (according to you), a "hashing" function. Got it. Thanks.