| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by himata4113 50 days ago
	They're trained in a model class likely in 2t to 3t range. It's very unlikely that chinese labs have access to gpu systems capable of training models like that, let alone serving them. This requires proprietary room-scale systems which fetch a huge premium over typical 10 slot systems. I am sure that they can develop their own equivlient version of such clusters in around 1 year though. Distilling fabel 5 will also go a long way.

3 comments

logicprog 50 days ago

DSv4 is nearly in the 2t range, but yes you're generally right

link

himata4113 50 days ago

MoE experts were likely trained independently / in a sparse format. Training anything beyond 2t on typical systems would be infuriantingly slow, you could do 4t on nvidias room-scale solution, but for a reasonable training speed / batch size it caps around 3t.

link

sosodev 50 days ago

Do you have any resources to share regarding independent expert training? I was under the impression that it's not feasible.

link

himata4113 50 days ago

concept is similar to how it works in inference, instead of performing regressive writes to the entire model you run the whole model, but part of the model can live in system memory and get swapped in/out on demand. So only XB parameters are active in training.

edit: I am not really sure if it works like that. I haven't looked too deep into deepseek v4 pro specifically.

link

axpy906 49 days ago

We’ll see it distilled first.

link

OtomotO 50 days ago

Ah, American Hubris ... I don't blame you, Hollywood is the world's greatest propaganda machinery of all times.

link