| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by internetguy 73 days ago

> MegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute engines. For each layer, we stream parameters in and compute gradients out, minimizing persistent device state

This is pretty awesome. The only compute I have at home is an RTX 3080 with 10 GB of VRAM, so I struggle with training larger models (>40M, 50M params). I get OOM errors and have to optimize a lot.

I have a lot more CPU RAM in my PC, and this would likely increase the size of models I can train locally.

5 comments

weitendorf 73 days ago

To make the most of these architectures I think the key is essentially moving more of the knowledge/capabilities out of the "weights" and into the complimentary parts of the system in a way that's proportionate to the capabilities of the hardware.

In the past couple months there's been a kind of explosion in small-models that are occupying a niche in this kind of AI-transcoding space. What I'm hoping we're right on the cusp of achieving is a similar explosion in what I'd call tool-adaptation, where an LLM paired with some mostly-fixed suite of tools and problem cases can trade off some generality for a specialized (potentially hyper-specialized to the company or user) role.

The thing about more transcoding-related tasks is that they in general stay in sync with what the user of the device is actively doing, which will also typically be closely aligned with the capabilities of the user's hardware and what they want to do with their computer. So most people aren't being intentional about this kind of stuff right now, partly out of habit I think, because only just now does it make sense to think of personal computer as "stranded hardware" now that they can be steered/programmed somewhat autonomously.

I'm wondering if with the right approach to MoE on local devices (which local llms are heading towards) we could basically amortize the expensive hit from loading weights in and out of VRAM through some kind of extreme batch use case that users still find useful enough to be worth the latency. LoRa is already really useful for this but obviously sometimes you need more expertise/specialization than just a few layers' difference. Experimenting with this right now. It's the same basic principle as in the paper except less of a technical optimization and more workload optimization. Also it's literally the beginning of machine culture so that's kind of cool

HarHarVeryFunny 72 days ago

> To make the most of these architectures I think the key is essentially moving more of the knowledge/capabilities out of the "weights" and into the complimentary parts of the system in a way that's proportionate to the capabilities of the hardware

I think that's only possible to limited extent. Learnt skills (RL in context of an LLM?) need to be in the weights of the model since this reflects the model's "personalized" learning of the behavioral feedback loop. Declarative knowledge (facts) can be loaded at runtime (RAG).

orbisvicis 73 days ago

That's interesting. So you want to train language, linguistic reasoning, and tool use, but otherwise strip out all knowledge in lieu of a massive context? Just grade they model on how well it can access local information, perhaps also run tools?

spacebacon 73 days ago

You are on the right track. Check out the Semiotic-Reflexive Transformer (SRT) here.

https://open.substack.com/pub/sublius/p/the-semiotic-reflexi...

spacebacon 72 days ago

Empirically proved a new science. Still gets downvoted xD.

hirako2000 73 days ago

The claims of the article assumes far more compute and far more VRAM..while the trick enables less back and forth, they don't eliminate it.

I doubt you meant 50M. Rather 50B?

You can only give it a try, but don't get your hopes high on a large context. If their technique works I would guess 8096k context limits would still OOM. 2048 maybe.

I'm extrapolating based on my experiment without this paper's trick to leverage the system memory.

kouteiheika 73 days ago

> You can only give it a try, but don't get your hopes high on a large context.

You may or may not know this, but: when training off-the-shelf LLMs (i.e. ones which have a huge vocabulary) what consumes a huge amount of memory usage is calculating the cross-entropy loss (which gets worse the more tokens you stuff in your batch), so always use a fused cross-entropy kernel.

For example, for a Gemma 2 model with 2B parameters at a batch size of 8k this consumes 24GB of VRAM by default (!); you can fuse your cross-entropy loss with @torch.compile and that can cut down this memory usage to something like a few gigabytes, but with a dedicated kernel this becomes a few megabytes.

gavinray 73 days ago

I'd not heard of this before, quick search turned up this 2025 post which suggests "fused cross-entropy loss" kernel was integrated into PyTorch:

https://pytorch.org/blog/peak-performance-minimized-memory/

  > "The integration involves modifying the TransformerDecoder module in torchtune to bypass the linear layer computation, allowing the Liger Fused Linear Cross Entropy Loss to handle the forward projection weights. "

Is this the same thing as you discuss above?

kouteiheika 73 days ago

Yes.

Although this wasn't integrated into PyTorch itself (but to torchtune, which is a different thing). If you're writing your own training loop you need to use a third-party kernel, e.g. the Liger kernel mentioned in the article, or Cut Cross Entropy (which is much better than the Liger one, although IIRC it has a numeric bug in one of its kernels making the results very slightly off).

hirako2000 73 days ago

Activation would still require gigabytes for a few kb context.

There are plenty of techniques to optimise. But the question is what can an rtx 3080 train before OOM. The answer is not that much.

Can barely do quantized fine tuning. Even then, small context.

kouteiheika 73 days ago

> Activation would still require gigabytes for a few kb context.

For that you use activation checkpointing, and you can also offload that to the CPU in a smart way to hide the latency. Although, yes, for long context training the activations do dominate the memory usage (and quantizing them degrades things more than just quantizing weights and/or optimizer states).

teaearlgraycold 73 days ago

Maybe they're not talking about LLMs? Training a 9M parameter YOLO can take over 20GB of VRAM if you use a large image size.

giancarlostoro 73 days ago

> This is pretty awesome. The only compute I have at home is an RTX 3080 with 10 GB of VRAM, so I struggle with training larger models (>40M, 50M params). I get OOM errors and have to optimize a lot.

I'm on the same GPU, its intimidating to me if I even want to bother training anything at all. Do you mind sharing what kind of training you've done with that GPU? :)

pixelsort 73 days ago

Make sure you are running adaptive cooling (or just bump them up) on the top fans in your case. Also, ensure that you are undervolting appropriately using something like MSI Afterburner or GreenWithEnvy.

If you don't, you could easily toast your RAM -- especially under BF16.

cyanydeez 73 days ago

Anything that can run on a AMD395+ w/128GB or whatever the apple equivelent would break things wide open. Training a model on my frameworks of choice or our business info would be awesome.

logicallee 73 days ago

Could I ask what you train your models to do? How do you generate the training data for it?