| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by anima-core 188 days ago

No, the compression result doesn't mean the original 64 GB model can run on a 292 MB card. The teacher model isn’t the thing thats compressed. It still needs to be loaded during training.

What gets small is the student. The tiny head trained on the teacher’s first layer fields. That head ends up a few MB because it's not a transformer at all. It's basically a lightweight function approximator that reproduces the teacher’s behavior on the specific task it was trained for.

So training still requires the usual multi-GB footprint. (Which can be done offline) After training, inference with the student requires only the head. That's why inference is cheap but you can't load the full teacher into 292 MB of VRAM.