| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by hansvm 1210 days ago

You can just directly calculate what would happen. To respond to novel words (which these demonstrably do) it needs to be equivalent to a character-wise hash table, and to be the same size as LLaMA you can do lookups on around 4 characters (and you have to deal with the data sparsity in constructing many of those tuples). If you want worse output but a better hash table on the output that remains, you could hash words or common words and get contexts of up to a few words rather than a few letters.

LLMs can track mid-range dependencies though. Consider the following input

> Translate the phrase "the lazy brown fox jumped over the thorny brambles" into French, write the translation, and then write the second through fourth words of that translation.

Looking at any one word of the output you need to track many of the input words to get it correct, and the relative positions of those necessary input words is not consistent from one output word to the next. ChatGPT solves the task flawlessly (aside from its habit of explaining what it's doing before doing it). Any hash table solution, at a minimum, would need a complicated heuristic for determining which words/characters to look up.

Doing so brings us back closer to the state of language models before transformers. You had a lot of hand-tuned features, formal grammars, complicated orders of operations, expert lookup tables, and whatnot. Performance was still much, much worse than what we're getting now with deep learning.

None of that is to say that philosophically we're doing anything more than mishmashing probabilities or that something better doesn't exist, but without significant innovation rule-guided fuzzy hash tables aren't it.

1 comments

akomtu 1210 days ago

The fuzzy hash table would use 8192 long token sequences of tokens as keys, and when requested to fetch a key, it would find the nearest keys and return that distribution. The internal representation of this hash table is a cloud of tokens in a 8192×sizeof(token) dimensional space.

The procedure of constructing this table would be just getting all the 1.5 trillion subsequences, each 8192 tokens long, and inserting it: table[seq8192] = token8193 (the next token). Arranging this data efficiently to allow fast lookups is the problem.

link

hansvm 1210 days ago

Ah, so less a hash table and more vanilla KNN?

Edit: I missed this on the first pass, but I'm totally lost as to where 1.5T comes from. Even if you only have two tokens there are vastly more 8192-length subsequences than that (something like 2^8151.5 times more), and if we're just trying to replicate the same space as something like GPT3.5 or LLaMA then you only get on the order of 0.065T to 0.175T entries to play with, much less when you consider that you have a full probability distribution to store (divide by your unique token count, and again by at least 2 if we store at least IEEE f16 probabilities).

link

akomtu 1210 days ago

k-nearest neighbors? Sort of, but I'd rather describe it as a geospatial map in many dimensions.

link

hansvm 1210 days ago

There are lots of interpretations. I actually like KNN for a lot of tasks. My gut says that it still wouldn't perform well here (and for the record, there are efficient data structures for the idea you're describing unless you have some nonstandard modifications, so "arranging the data efficiently to allow fast lookups" is definitely not the core problem), but I admittedly don't have proof of that yet.

For some intuition, imagine the following tasks:

> Repeat the following phrase exactly twice: "sdflhasdflhasdf"

> Repeat the following phrase exactly twice: "sdflhasdflhasdg"

Your fuzzy dictionary or geospatial map can't possibly have enough keys to distinguish the requests (or if it distinguishes those, you can adversarially select different keyboard mashes), and so the result, no matter what it is, would have the same probability distribution for both prompts. Since the desired results are different, at least one of those would be have some unavoidable wrongness.

The GPT family, on the other hand, has few issues with random phrase duplication since positional information is something it explicitly considers and is capable of prioritizing over other token information.

link

akomtu 1209 days ago

Indeed, that's good counterexample.

link