Hacker News new | ask | show | jobs
by AJRF 501 days ago
> But surely quite a lot of the 671 GB or whatever is knowledge that is usually irrelevant?

Small correction - It's 671B Parameters - not 671 Gigabytes (doing some rudimentary math if you want to run the entire model in memory it would take ~750GB (671b * fp8 == 8 bits * 1.2 (20% overhead)) = 749.901 GiB)

It's a MoE model so you don't actually need to load all 750gb at once.

I think maybe what you are asking is "Why do more params make a better model?"

Generally speaking its because if you have more units of representation (params) you can encode more information about the relationships in the data used to train the model.

Think of it like building a LEGO city.

A model with fewer parameters is like having a small LEGO set with fewer blocks. You can still build something cool, like a little house or a car, but you're limited in how detailed or complex it can be.

A model with more parameters is like having a giant LEGO set with thousands of pieces in all shapes and colours. Now, you can build an entire city with skyscrapers, parks, and detailed streets.

---

In terms of "is a lot of of irrelevant?" - This is a hot area of research!

It's very difficult currently to know what parameters are relevant and what aren't - there is an area of research called mechanistic interpretability that aims to illuminate this - if you are interested - Anthropic released a good paper called "Golden Gate Claude" on this.

2 comments

To extend your Lego metaphor to the question of “is a lot of this irrelevant?“ Does your Lego model of a city need to have interior floors, furniture, and fixtures in order to satisfy your requirements? Perhaps in some cases, but not in most.
I know it's a MoE and I didn't need the five year old explanation of why larger models are smarter. I'm also aware of interpretability research. You should read my question much more carefully and think about it harder.
Although the practicality of what you described towards the end of your original comment conceptually demonstrates an MoE-like architecture, the fact that you explicitly mentioned not understanding why larger models are smarter and then proceeded to try to couch-engineer a new, smaller architecture suggests that you were in fact not aware of the MoE architecture and thus the ELI5 LEGO approach was reasonably helpful. I’ve read your question carefully many times, and I’ve read others’ comments in the thread; you seem frustrated that folks aren’t answering your questions when in fact they have been answered — albeit not in the way you seem to want; how can we fix this?