| Why is it that the larger models are better at understanding and following more and more complex instructions. And generally just smarter? With DeepSeek we can now run on non-GPU servers with a lot of RAM. But surely quite a lot of the 671 GB or whatever is knowledge that is usually irrelevant? I guess what I sort of am thinking of is something like a model that comes with its own built in vector db and search as part of every inference cycle or something. But I know that there is something about the larger models that is required for really intelligent responses. Or at least that is what it seems because smaller models are just not as smart. If we could figure out how to change it so that you would rarely need to update the background knowledge during inference and most of that could live on disk, that would make this dramatically more economical. Maybe a model could have retrieval built in, and trained on reducing the number of retrievals the longer the context is. Or something. |
Small correction - It's 671B Parameters - not 671 Gigabytes (doing some rudimentary math if you want to run the entire model in memory it would take ~750GB (671b * fp8 == 8 bits * 1.2 (20% overhead)) = 749.901 GiB)
It's a MoE model so you don't actually need to load all 750gb at once.
I think maybe what you are asking is "Why do more params make a better model?"
Generally speaking its because if you have more units of representation (params) you can encode more information about the relationships in the data used to train the model.
Think of it like building a LEGO city.
A model with fewer parameters is like having a small LEGO set with fewer blocks. You can still build something cool, like a little house or a car, but you're limited in how detailed or complex it can be.
A model with more parameters is like having a giant LEGO set with thousands of pieces in all shapes and colours. Now, you can build an entire city with skyscrapers, parks, and detailed streets.
---
In terms of "is a lot of of irrelevant?" - This is a hot area of research!
It's very difficult currently to know what parameters are relevant and what aren't - there is an area of research called mechanistic interpretability that aims to illuminate this - if you are interested - Anthropic released a good paper called "Golden Gate Claude" on this.