Hacker News new | ask | show | jobs
by datadrivenangel 35 days ago
In my experience once you get to ~30 gigs of ram for a model like Gemma4, the rest of the 128g of memory is simply nice to have. The speed and costs are what make it tough though, because its slower and more expensive than the same model served on a big accelerator card, and is going to be worse than a frontier model.
2 comments

I wonder if it really needs to be worse. I am playing with the idea of fine tuning a model on my exact stack and coding patterns. I suspect I could get better performance by training “taste” into a model rather than breadth.
I also wonder about JS only, Python only, etc models.

Maybe the future is a selection of local, specific stack trained models?

There is some recent work on modularizing knowledge in LLMs.

https://arxiv.org/html/2605.06663v1

It might be possible to train a big generalist that is a composition of modules, some of which can be dropped dynamically at inference time, depending on the prompt.

Cool. Thanks for sharing. I am thinking about creating a series of smaller models for specific purposes and then orchestrating them so they mirror the human brain which is a bunch of subsystems that give multiple opinions about the same stimulus
Interesting direction. I’ve also been thinking about modular / subsystem-based approaches for specialized tasks in small AI systems.
These models being able to generalise at coding will likely get worse if you remove high quality training data like all of python.
That approach has its advantages, but sometimes I want to generate code for a language or kind of project I’m not experienced with using the accepted best practices.
Fine tuning these models (at least with PPO or equivalent) requires even more VRAM than inference does, potentially 2-3 times more.
You could use PEFT? Operating on only a subset of weights is fairly standard practice nowadays …
Yes I used LoRA and it’s fine but I’m not convinced the model doesn’t end up more stupid and less general
You need the rest of the ram for the context. If you don't want to end up with a toy context or quantized lossy context, is pretty easy to end up having to spend up 50+GB just for the KV cache, per simutaneous inference slot.