Hacker News new | ask | show | jobs
by akomtu 1210 days ago
The fuzzy hash table would use 8192 long token sequences of tokens as keys, and when requested to fetch a key, it would find the nearest keys and return that distribution. The internal representation of this hash table is a cloud of tokens in a 8192×sizeof(token) dimensional space.

The procedure of constructing this table would be just getting all the 1.5 trillion subsequences, each 8192 tokens long, and inserting it: table[seq8192] = token8193 (the next token). Arranging this data efficiently to allow fast lookups is the problem.

1 comments

Ah, so less a hash table and more vanilla KNN?

Edit: I missed this on the first pass, but I'm totally lost as to where 1.5T comes from. Even if you only have two tokens there are vastly more 8192-length subsequences than that (something like 2^8151.5 times more), and if we're just trying to replicate the same space as something like GPT3.5 or LLaMA then you only get on the order of 0.065T to 0.175T entries to play with, much less when you consider that you have a full probability distribution to store (divide by your unique token count, and again by at least 2 if we store at least IEEE f16 probabilities).

k-nearest neighbors? Sort of, but I'd rather describe it as a geospatial map in many dimensions.
There are lots of interpretations. I actually like KNN for a lot of tasks. My gut says that it still wouldn't perform well here (and for the record, there are efficient data structures for the idea you're describing unless you have some nonstandard modifications, so "arranging the data efficiently to allow fast lookups" is definitely not the core problem), but I admittedly don't have proof of that yet.

For some intuition, imagine the following tasks:

> Repeat the following phrase exactly twice: "sdflhasdflhasdf"

> Repeat the following phrase exactly twice: "sdflhasdflhasdg"

Your fuzzy dictionary or geospatial map can't possibly have enough keys to distinguish the requests (or if it distinguishes those, you can adversarially select different keyboard mashes), and so the result, no matter what it is, would have the same probability distribution for both prompts. Since the desired results are different, at least one of those would be have some unavoidable wrongness.

The GPT family, on the other hand, has few issues with random phrase duplication since positional information is something it explicitly considers and is capable of prioritizing over other token information.

Indeed, that's good counterexample.