|
|
|
|
|
by lolinder
849 days ago
|
|
There's a misconception in the question that is important to address first: when an LLM is running inference it isn't querying its training data at all, it's just using a function that we created previously (the "model") to predict the next word in a block of text. That's it. When considering plain inference (no web search or document lookup), the decisions that determine a model's speed and capabilities come before the inference step, during the creation of the model. Building an LLM model consists of defining its "architecture" (an enormous mathematical function that defines the model's shape) and then using a lot of trial and error to guess which "parameters" (constants that we plug in to the function, like 'm' and 'b' in y=mx+b) will be most likely to produce text that resembles the training data. So, to your question: LLMs tend to perform better the more parameters they have, so larger models will tend to beat smaller models. Larger models also require a lot of processing power and/or time per inferred token, so we do tend to see that better models take more processing power. But this is because larger models tend to be better, not because throwing more compute at an existing model helps it produce better results. |
|
* LLMs are lossy compression functions on their training data.
* The size of the model dictates how lossy the compression is.
* You can't spend compute to get more detail out of a model once it's been compressed/trained, anymore than you can spend compute to get an incredibly lossily-compressed movie to go from 240p back to the original 1080p source.