Hacker News new | ask | show | jobs
by lolinder 849 days ago
There's a misconception in the question that is important to address first: when an LLM is running inference it isn't querying its training data at all, it's just using a function that we created previously (the "model") to predict the next word in a block of text. That's it. When considering plain inference (no web search or document lookup), the decisions that determine a model's speed and capabilities come before the inference step, during the creation of the model.

Building an LLM model consists of defining its "architecture" (an enormous mathematical function that defines the model's shape) and then using a lot of trial and error to guess which "parameters" (constants that we plug in to the function, like 'm' and 'b' in y=mx+b) will be most likely to produce text that resembles the training data.

So, to your question: LLMs tend to perform better the more parameters they have, so larger models will tend to beat smaller models. Larger models also require a lot of processing power and/or time per inferred token, so we do tend to see that better models take more processing power. But this is because larger models tend to be better, not because throwing more compute at an existing model helps it produce better results.

7 comments

An analogy that works without having to explain anything at all about how LLMs actually work (or maybe does explain a lot, depending on how you look at it) could be:

* LLMs are lossy compression functions on their training data.

* The size of the model dictates how lossy the compression is.

* You can't spend compute to get more detail out of a model once it's been compressed/trained, anymore than you can spend compute to get an incredibly lossily-compressed movie to go from 240p back to the original 1080p source.

You obviously can do that though; diffusion models produce better (fsvo better) images the more steps you run of them.

Similarly, LLMs can produce better answers if you teach them thinking strategies that remind them to put the available evidence and intermediate steps in their context window. Otherwise they'll tend to hallucinate an answer out of vaguely correct words.

Diffusion models are a different architecture, namely, a recursive or iterative one. Transformer models are not recursive or iterative.
Sure they are. It only natively outputs one token; the recursive process is how you get the rest out of them.
You’re totally right … should’ve thought that one through more.
> You can't spend compute to get more detail [...]

Upscaling, technically, is a thing without limits, no?

> But this is because larger models tend to be better, not because throwing more compute at an existing model helps it produce better results.

There's a caveat here - allowing the model to produce more tokens (i.e. giving it more compute time to "think") can produce better results. E.g. asking a model to reason before producing an answer, leads to better answers. And the extra tokens = more compute.

True! It's important to first understand the fundamentals of what makes an LLM "good" and what makes it fast, but yes, there are lots of techniques you can apply right before and during the inference step that can trade off between speed and capabilities.

Different prompting techniques like what you're describing are one way, and RAG [0] and ART [1] are also in a similar category.

[0] https://stackoverflow.blog/2023/10/18/retrieval-augmented-ge...

[1] https://www.promptingguide.ai/techniques/art

And adding some more here. I don't know if any models doing this but there is the possibility of generating tokens that it does not show to the user. I think there's quite a lot of scope for internal monologue/chain of thought that could provide concise but clever answers. The difficulty in this is the latency while it ponders to itself, but having played with the groq demos. I think there's scope for a decent interactive experience.

The concern people might feel when they realise an ai might have private thoughts is another issue entirely.

That was indeed part of what I wondering about.

Larger and smaller, in my beginner mind, was a difference of much recursiveness the design of the model allowed.

- User request implies knowledge about X. - PULLING in weights for X. - Probability of user knowing about Xm and Xz is low (because the training data says Xm and Xz are PhD-level knowledge or something). - Pulling in weights for an ELI5-level explanation of Xm and Xz ...

I thought, an LLM would do this recursive pulling of weights based on the semantics of the user request, which it does, but it doesn't do that "dynamically" based on "recalculated" weights and regenerated combos of tokens, which could happen if the training data wasn't "frozen" and accessible, which I learned further down in the comments, isn't.

That's why I wondered whether more processing power and or time would benefit this recursive generation and pulling.

Yea, doing thing like Chain of Thought, and/or running a second query to examine the first set of tokens it generated commonly improve answers.
This doesn't change the point of your answer, but to add on, the result of that learned function is the probability of all tokens occurring next which is sampled when inference is happening. The type of sampling used can be different at inference time.
I'm still figuring out "inference time" but what left me puzzled at first was that there is - to humans at least - an infinite amount of tokens that might come next, technical jargon, synonyms, lexical levels in general, so in my mind there was an RNG build into the function, that, after "filtering" the weights based on the user request - and a lot of different tokens, even those meaning the same or almost the same have the same weights - simply rolled the dice to produce the return string.

I thought the LLM was "getting to know the user" but it had it a short memory span (the context) and thus "forgot" already calculated weights that it would use to (re)generate new weights.

Further down I learned it freaking forgets all the previous weights in general (I think that's what I learned, I'm getting there)

could a training model be fed the raw data or source and weights of an llm and create better functioning llms by spotting patterns and things between models? like if you could feed it all the open source models and it could create sub models off of those and maybe even a 2nd Gen 'self' instance to better train on the second set such that maybe it could find ways to get the same results with 5b model as 75b.
People take a model and continue training it all the time (that is, start with already derived weights of one model and doing more training on it to make it something different). Usually this is done to make the model more purpose fit to a specific task, but it won't often make it generically better assuming the first effort was using the model to its full potential (not "underfit").

The 75B param model simply has more complexity to work with than the 5B model.

In the same sense that: `y = mx + b` is just not as expressive as `y = ax^2 + bx + c`.

well, i was thinking more like..... something that could spit out an android app because it's source is 5k android apps binary/hex code...i.e. it goes off internals, basically its a model of models. So it could find some common ground between all models, and create a new model that's the best of all of them. Then add itself to that list of models, and start up the next generation to do it all over again, including itself, and keep repeating until it can't get any better maybe, or until it finds a new way of doing training, or something. I guess I'm looking for a way to speedup the ai singularity when ai can build upon itself, or really learn like a human -as in receive new input and it's added to the whole of the thing in real time.
That's mostly a shortcut to making the model worse rather than better because it'll just continually get more obsessive having learned about its own biases.

It's viable if you have tools or humans in the loop to comment on them and add new insights.

But the speed isn't really a factor here, and seeing 1000 new apps isn't obviously going to make it better if the model is already at the limits of what it can represent with its parameter count and compression so to speak.

I could imagine something like that working in theory, but the amount of examples you would need to train such a model makes it completely impractical. We tend to need billions of examples to get a modern deep learning model working well, and it will be a very long time before reach that many examples of good LLMs.
In a way this is already how the model is trained. Model makes a prediction, loss function calculates how “wrong” the prediction was, and we update the weights of the model to minimize the loss.
This is an excellent concise descriptions on how LLM works, thanks.
You are incorrect. Increasing compute during inference renders similar gains to increasing parameters/compute during training time (see self-consistency, tree of thoughts, etc.)
Can you elaborate upon that? Apart from the multiplication and accumulations of activations and weights what additional computations can be applied to improve the outputs.

I think it has already been implied that we are not talking about increasing the quantity of parameters in this context but the possibily of applying additional compute to a model with a given number of parameters

You can train a smaller model and run inference multiple times and it will reach similar performance as a larger model running inference just once. What's the best way to make use of those multiple inferences is still up to debate, but we already know it works (self-consistency is one example).
I wasn't able to elaborate on what I mean with "better" when I asked the question but the idea can indeed be summarized with "will an LLM increase quantity and quality of parameters if you give it more processing power and time". Now I know that language models don't do that at all and that the weights of the user request stored in the "frozen" training data is what assembles the return after generating possible output strings, which are selected by pre-prompts like asking for chain of thought and reasoning paths and so on, which in the end, are nothing more than more weights pulling in more specific context. (I'm just thinking out loud here)
Yeah, I totally forgot about training time and time of request (aaah, inference time! now I get it.) being completely different points in time because the LLM has no access to the training data anymore.
Right on. A total misconception on my part. And your answer was a nice primer before diving in to the rest of the comments. Thanks!