Hacker News new | ask | show | jobs
by YetAnotherNick 317 days ago
Deepseek main run costed $6M. qwen3-30b-a3b probably would cost few $100Ks, which is ranked 13th.

GPU cost of the final model training isn't the biggest chunk of the cost and you can probably replicate results of models like Llama 3 very cheaply. It's the cost of experiments, researchers, data collection which brings overall cost 1 or 2 order of magnitude higher.

1 comments

What's your source for any of that? I think the $6 million thing was identified as a lie they felt was necessary because of GPU export laws.
It wasn't a lie, it was a misrepresentation of the total cost. It's not hard to calculate the cost of the training though. It takes 6 * active parameters * tokens flops[1]. To get number of seconds you can divide by Flops/s * MFU, where MFU is around 45% for H100 for large enough models[2].

[1]: https://arxiv.org/abs/2001.08361

[2]: https://github.com/facebookresearch/lingua

That paper's 5 years old at this point, dating back to when Amodei was still an OpenAI employee. Has any newer work superseded it, or are those assumptions still considered solid?
Those assumptions are still the same. Although now context length has increased more so the n^2 part is non negligible. See the repo for correct flop calculation[1]

[1]: https://github.com/facebookresearch/lingua/blob/437d680e5218...