| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by datadrivenangel 35 days ago
	In my experience once you get to ~30 gigs of ram for a model like Gemma4, the rest of the 128g of memory is simply nice to have. The speed and costs are what make it tough though, because its slower and more expensive than the same model served on a big accelerator card, and is going to be worse than a frontier model.

2 comments

digitaltrees 35 days ago

I wonder if it really needs to be worse. I am playing with the idea of fine tuning a model on my exact stack and coding patterns. I suspect I could get better performance by training “taste” into a model rather than breadth.

link

epicureanideal 35 days ago

I also wonder about JS only, Python only, etc models.

Maybe the future is a selection of local, specific stack trained models?

link

robrenaud 34 days ago

There is some recent work on modularizing knowledge in LLMs.

https://arxiv.org/html/2605.06663v1

It might be possible to train a big generalist that is a composition of modules, some of which can be dropped dynamically at inference time, depending on the prompt.

link

digitaltrees 29 days ago

Cool. Thanks for sharing. I am thinking about creating a series of smaller models for specific purposes and then orchestrating them so they mirror the human brain which is a bunch of subsystems that give multiple opinions about the same stimulus

link

shailendra_sis 29 days ago

Interesting direction. I’ve also been thinking about modular / subsystem-based approaches for specialized tasks in small AI systems.

link

andy_ppp 35 days ago

These models being able to generalise at coding will likely get worse if you remove high quality training data like all of python.

link

jimbokun 34 days ago

That approach has its advantages, but sometimes I want to generate code for a language or kind of project I’m not experienced with using the accepted best practices.

link

andy_ppp 35 days ago

Fine tuning these models (at least with PPO or equivalent) requires even more VRAM than inference does, potentially 2-3 times more.

link

rusk 34 days ago

You could use PEFT? Operating on only a subset of weights is fairly standard practice nowadays …

link

andy_ppp 34 days ago

Yes I used LoRA and it’s fine but I’m not convinced the model doesn’t end up more stupid and less general

link

ElectricalUnion 33 days ago

You need the rest of the ram for the context. If you don't want to end up with a toy context or quantized lossy context, is pretty easy to end up having to spend up 50+GB just for the KV cache, per simutaneous inference slot.

link