| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lolinder 849 days ago

There's a misconception in the question that is important to address first: when an LLM is running inference it isn't querying its training data at all, it's just using a function that we created previously (the "model") to predict the next word in a block of text. That's it. When considering plain inference (no web search or document lookup), the decisions that determine a model's speed and capabilities come before the inference step, during the creation of the model.

Building an LLM model consists of defining its "architecture" (an enormous mathematical function that defines the model's shape) and then using a lot of trial and error to guess which "parameters" (constants that we plug in to the function, like 'm' and 'b' in y=mx+b) will be most likely to produce text that resembles the training data.

So, to your question: LLMs tend to perform better the more parameters they have, so larger models will tend to beat smaller models. Larger models also require a lot of processing power and/or time per inferred token, so we do tend to see that better models take more processing power. But this is because larger models tend to be better, not because throwing more compute at an existing model helps it produce better results.

7 comments

cjbprime 849 days ago

An analogy that works without having to explain anything at all about how LLMs actually work (or maybe does explain a lot, depending on how you look at it) could be:

* LLMs are lossy compression functions on their training data.

* The size of the model dictates how lossy the compression is.

* You can't spend compute to get more detail out of a model once it's been compressed/trained, anymore than you can spend compute to get an incredibly lossily-compressed movie to go from 240p back to the original 1080p source.

astrange 848 days ago

You obviously can do that though; diffusion models produce better (fsvo better) images the more steps you run of them.

Similarly, LLMs can produce better answers if you teach them thinking strategies that remind them to put the available evidence and intermediate steps in their context window. Otherwise they'll tend to hallucinate an answer out of vaguely correct words.

profile53 848 days ago

Diffusion models are a different architecture, namely, a recursive or iterative one. Transformer models are not recursive or iterative.

astrange 848 days ago

Sure they are. It only natively outputs one token; the recursive process is how you get the rest out of them.

profile53 847 days ago

You’re totally right … should’ve thought that one through more.

frannyg 848 days ago

> You can't spend compute to get more detail [...]

Upscaling, technically, is a thing without limits, no?

rohanmehta1 849 days ago

> But this is because larger models tend to be better, not because throwing more compute at an existing model helps it produce better results.

There's a caveat here - allowing the model to produce more tokens (i.e. giving it more compute time to "think") can produce better results. E.g. asking a model to reason before producing an answer, leads to better answers. And the extra tokens = more compute.

lolinder 849 days ago

True! It's important to first understand the fundamentals of what makes an LLM "good" and what makes it fast, but yes, there are lots of techniques you can apply right before and during the inference step that can trade off between speed and capabilities.

Different prompting techniques like what you're describing are one way, and RAG [0] and ART [1] are also in a similar category.

[0] https://stackoverflow.blog/2023/10/18/retrieval-augmented-ge...

[1] https://www.promptingguide.ai/techniques/art

Lerc 849 days ago

And adding some more here. I don't know if any models doing this but there is the possibility of generating tokens that it does not show to the user. I think there's quite a lot of scope for internal monologue/chain of thought that could provide concise but clever answers. The difficulty in this is the latency while it ponders to itself, but having played with the groq demos. I think there's scope for a decent interactive experience.

The concern people might feel when they realise an ai might have private thoughts is another issue entirely.

frannyg 848 days ago

That was indeed part of what I wondering about.

Larger and smaller, in my beginner mind, was a difference of much recursiveness the design of the model allowed.

- User request implies knowledge about X. - PULLING in weights for X. - Probability of user knowing about Xm and Xz is low (because the training data says Xm and Xz are PhD-level knowledge or something). - Pulling in weights for an ELI5-level explanation of Xm and Xz ...

I thought, an LLM would do this recursive pulling of weights based on the semantics of the user request, which it does, but it doesn't do that "dynamically" based on "recalculated" weights and regenerated combos of tokens, which could happen if the training data wasn't "frozen" and accessible, which I learned further down in the comments, isn't.

That's why I wondered whether more processing power and or time would benefit this recursive generation and pulling.

pixl97 849 days ago

Yea, doing thing like Chain of Thought, and/or running a second query to examine the first set of tokens it generated commonly improve answers.

tracerbulletx 849 days ago

This doesn't change the point of your answer, but to add on, the result of that learned function is the probability of all tokens occurring next which is sampled when inference is happening. The type of sampling used can be different at inference time.

frannyg 848 days ago

I'm still figuring out "inference time" but what left me puzzled at first was that there is - to humans at least - an infinite amount of tokens that might come next, technical jargon, synonyms, lexical levels in general, so in my mind there was an RNG build into the function, that, after "filtering" the weights based on the user request - and a lot of different tokens, even those meaning the same or almost the same have the same weights - simply rolled the dice to produce the return string.

I thought the LLM was "getting to know the user" but it had it a short memory span (the context) and thus "forgot" already calculated weights that it would use to (re)generate new weights.

Further down I learned it freaking forgets all the previous weights in general (I think that's what I learned, I'm getting there)

gremlinsinc 849 days ago

could a training model be fed the raw data or source and weights of an llm and create better functioning llms by spotting patterns and things between models? like if you could feed it all the open source models and it could create sub models off of those and maybe even a 2nd Gen 'self' instance to better train on the second set such that maybe it could find ways to get the same results with 5b model as 75b.

spywaregorilla 849 days ago

People take a model and continue training it all the time (that is, start with already derived weights of one model and doing more training on it to make it something different). Usually this is done to make the model more purpose fit to a specific task, but it won't often make it generically better assuming the first effort was using the model to its full potential (not "underfit").

The 75B param model simply has more complexity to work with than the 5B model.

In the same sense that: `y = mx + b` is just not as expressive as `y = ax^2 + bx + c`.

gremlinsinc 848 days ago

well, i was thinking more like..... something that could spit out an android app because it's source is 5k android apps binary/hex code...i.e. it goes off internals, basically its a model of models. So it could find some common ground between all models, and create a new model that's the best of all of them. Then add itself to that list of models, and start up the next generation to do it all over again, including itself, and keep repeating until it can't get any better maybe, or until it finds a new way of doing training, or something. I guess I'm looking for a way to speedup the ai singularity when ai can build upon itself, or really learn like a human -as in receive new input and it's added to the whole of the thing in real time.

spywaregorilla 847 days ago

That's mostly a shortcut to making the model worse rather than better because it'll just continually get more obsessive having learned about its own biases.

It's viable if you have tools or humans in the loop to comment on them and add new insights.

But the speed isn't really a factor here, and seeing 1000 new apps isn't obviously going to make it better if the model is already at the limits of what it can represent with its parameter count and compression so to speak.

lolinder 849 days ago

I could imagine something like that working in theory, but the amount of examples you would need to train such a model makes it completely impractical. We tend to need billions of examples to get a modern deep learning model working well, and it will be a very long time before reach that many examples of good LLMs.

janalsncm 849 days ago

In a way this is already how the model is trained. Model makes a prediction, loss function calculates how “wrong” the prediction was, and we update the weights of the model to minimize the loss.

teleforce 849 days ago

This is an excellent concise descriptions on how LLM works, thanks.

rafaelero 849 days ago

You are incorrect. Increasing compute during inference renders similar gains to increasing parameters/compute during training time (see self-consistency, tree of thoughts, etc.)

Lerc 849 days ago

Can you elaborate upon that? Apart from the multiplication and accumulations of activations and weights what additional computations can be applied to improve the outputs.

I think it has already been implied that we are not talking about increasing the quantity of parameters in this context but the possibily of applying additional compute to a model with a given number of parameters

rafaelero 849 days ago

You can train a smaller model and run inference multiple times and it will reach similar performance as a larger model running inference just once. What's the best way to make use of those multiple inferences is still up to debate, but we already know it works (self-consistency is one example).

frannyg 848 days ago

I wasn't able to elaborate on what I mean with "better" when I asked the question but the idea can indeed be summarized with "will an LLM increase quantity and quality of parameters if you give it more processing power and time". Now I know that language models don't do that at all and that the weights of the user request stored in the "frozen" training data is what assembles the return after generating possible output strings, which are selected by pre-prompts like asking for chain of thought and reasoning paths and so on, which in the end, are nothing more than more weights pulling in more specific context. (I'm just thinking out loud here)

frannyg 848 days ago

Yeah, I totally forgot about training time and time of request (aaah, inference time! now I get it.) being completely different points in time because the LLM has no access to the training data anymore.

frannyg 848 days ago

Right on. A total misconception on my part. And your answer was a nice primer before diving in to the rest of the comments. Thanks!