| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ilaksh 547 days ago

Why is it that the larger models are better at understanding and following more and more complex instructions. And generally just smarter?

With DeepSeek we can now run on non-GPU servers with a lot of RAM. But surely quite a lot of the 671 GB or whatever is knowledge that is usually irrelevant?

I guess what I sort of am thinking of is something like a model that comes with its own built in vector db and search as part of every inference cycle or something.

But I know that there is something about the larger models that is required for really intelligent responses. Or at least that is what it seems because smaller models are just not as smart.

If we could figure out how to change it so that you would rarely need to update the background knowledge during inference and most of that could live on disk, that would make this dramatically more economical.

Maybe a model could have retrieval built in, and trained on reducing the number of retrievals the longer the context is. Or something.

6 comments

AJRF 547 days ago

> But surely quite a lot of the 671 GB or whatever is knowledge that is usually irrelevant?

Small correction - It's 671B Parameters - not 671 Gigabytes (doing some rudimentary math if you want to run the entire model in memory it would take ~750GB (671b * fp8 == 8 bits * 1.2 (20% overhead)) = 749.901 GiB)

It's a MoE model so you don't actually need to load all 750gb at once.

I think maybe what you are asking is "Why do more params make a better model?"

Generally speaking its because if you have more units of representation (params) you can encode more information about the relationships in the data used to train the model.

Think of it like building a LEGO city.

A model with fewer parameters is like having a small LEGO set with fewer blocks. You can still build something cool, like a little house or a car, but you're limited in how detailed or complex it can be.

A model with more parameters is like having a giant LEGO set with thousands of pieces in all shapes and colours. Now, you can build an entire city with skyscrapers, parks, and detailed streets.

---

In terms of "is a lot of of irrelevant?" - This is a hot area of research!

It's very difficult currently to know what parameters are relevant and what aren't - there is an area of research called mechanistic interpretability that aims to illuminate this - if you are interested - Anthropic released a good paper called "Golden Gate Claude" on this.

link

reagle 546 days ago

To extend your Lego metaphor to the question of “is a lot of this irrelevant?“ Does your Lego model of a city need to have interior floors, furniture, and fixtures in order to satisfy your requirements? Perhaps in some cases, but not in most.

link

ilaksh 547 days ago

I know it's a MoE and I didn't need the five year old explanation of why larger models are smarter. I'm also aware of interpretability research. You should read my question much more carefully and think about it harder.

link

tharant 544 days ago

Although the practicality of what you described towards the end of your original comment conceptually demonstrates an MoE-like architecture, the fact that you explicitly mentioned not understanding why larger models are smarter and then proceeded to try to couch-engineer a new, smaller architecture suggests that you were in fact not aware of the MoE architecture and thus the ELI5 LEGO approach was reasonably helpful. I’ve read your question carefully many times, and I’ve read others’ comments in the thread; you seem frustrated that folks aren’t answering your questions when in fact they have been answered — albeit not in the way you seem to want; how can we fix this?

link

zamadatix 547 days ago

This is, more or less, what mixture-of-experts (MoE) section is picking away at. The difference is rather than trying to break it out via how rare or common the info is it's broken out by specialization. There isn't as much a focus on keeping the inactive portions on disk because it's more economical to host it all but in a way that lets you use parallelism of requests across the experts. This has the added effect you can constantly select the best expert as the answer is generated without losing efficient hosting.

link

ilaksh 547 days ago

I know what MoE is. Maybe read my comments more carefully and give me the benefit of the doubt.

link

zamadatix 547 days ago

My comment would've done an astoundingly bad job at introducing you to what mixture of experts is, had that been its goal. It's really about why the MoE-style enhancements don't target how to keep parts on disk when optimizing the model to be most economical to host. There's really not any doubt in that, it's just an observation as to why they optimize the way they do.

If you were put off by defining terms on first use: that's just good form, not something related to you.

link

joshuakogut 547 days ago

Yesterday when I started evaluating Deepseek-R1 V3 it was insanely better at code generation using elaborate prompts, I asked it to write me some boilerplate code in python using the ebaysdk library to pull a list of all products sold by user with $name and it spit it out, just a few tweaks and it was ready to go.

I tried the same thing on the 7B and 32B model today, neither are as effective as codellama.

link

ilaksh 547 days ago

I think people didn't understand my comment. I am very aware of this already.

link

seba_dos1 546 days ago

I think you failed to convey what you meant to with your comment.

If you want your contribution to the discussion to be meaningful, you may want to give it another go.

link

A4ET8a8uTh0_v2 547 days ago

I am intrigued. What did you use to run your deepseek instance?

link

lossolo 546 days ago

You would need to extract logical patterns and concepts somehow, not just word relationships. I know what you mean, this introduces another level of abstraction between relationships. If there is no way to extract these patterns, or if there are no real logical patterns present but only statistical relationships (larger model = more relationships = better prompt following etc) between words without any real 'emergent abilities' then Transformers are essentially a dead end in the context of AGI.

link

HarHarVeryFunny 547 days ago

I'm sure that a smaller generalist model with RAG would work for many cases, especially where the RAG is just looking up some facts or technique, but would you really want a smart high school kid who's googled brain surgery to be operating on your brain? Books are useful for looking up facts, but there's no substitute for experience/training in actually getting good at something.

link

doubleyou 546 days ago

if you google LLM youll see the first L stands for large.

link