| Thanks for the link to the 3b1b video. I enjoyed the entire series, and some of those he linked to as well. The linked ones explained the history of how they got there - which was new to me and really helps cement some ideas. However, I didn't learn much. Which means as far as I can tell, my mental model of how it works wasn't far off. So yes - I was already aware one interpretation of how these things work is that LLMs turn concepts into vectors in a high dimensional space, and high level abstractions are linear summations of these vectors. Given that model, parts of your comment don't make much sense to me. For example "it's able to predict N different full paths WITHOUT actually exploring them fully" - why do you think that's so? And "GRAM allows it to look at the different hyper words and say" - no, GRAM does not look at different hyper words, or at least no more than a non-GRAM LLM does. Only the last word (the 4k vector or whatever dimension they are using) is fed back through the machine. Regarding "Once decoded into a rigid real word, the multidimensional nuance of the continuous vector is lost!". Yes, and no. Yes, the decoded word doesn't mean much compared to the vector it was derived from. But the machine isn't operating on just the last word. It's operating on the information in the entire context window (which could be 100's of thousands of tokens). There is undeniably a lot of nuance encoded in one vector, and yes you're right - it can't be represented with just one word. But it can be represented using a string of words, and generating that representation by spitting out more words is partially what an LLM is doing when it generates text. It's only partially doing that because it's randomly mutating that last token as it goes, and pulling in information into the vector from the MLP layers. Re "LLMs ... add so much more meaning to a word than a human ever could imagine try to put 10,000 dimensions on the word "the" .... OBVIOUSLY makes them enormously less intelligent!". No it doesn't obviously do that. A vector is maybe 16k bytes (depending on the number of dimensions). That corresponds to around 5000 words. Humans have no trouble connecting those 5000 words into a single concept - which would presumably spell out the concept represented by the vector. Same meaning - just encoded differently. Using computer science terminology - we could say the 16k vector is serialised into a sequence of words. So - two representations of the same thing. What humans do that LLMs can't do right now is squeeze those 5000 words into something tiny. For example, the word "LLM" is a huge concept, squeezed into 3 letters. The human knowledge and thought seems to be based on that one trick - naming abstract concepts, and then using them as building blocks for more abstract concepts. LLMs meanwhile are stuck with their fixed size vectors. They cannot add new concepts to their vocabulary by modifying their weights. Where LLMs seem to win is their short-term memory (of the order of 200K tokens), and they are about 1 million times faster (cycle time of the order of 1 nanosecond vs 1 millisecond), which gives their ability to reason very different properties to human reasoning. Sometimes this means they are (dramatically) better, and sometimes they are worse. I don't see how GRAM on its own is going to make LLMs 3 orders of magnitude faster than they are now. That 200k token context window is hideously expensive and maintaining it grows O(N^2). As you observed, they can already compress a 100,000 word book into the single token encoded in the last word (although beyond 100k words that compression starts to look increasingly lossy). To get the 3 orders of magnitude speed up, they are going to have to start taking advantage of that compression, and start throwing away the part of that 200k context they have already encoded. So far, no one has deployed something that does it well. |