|
|
|
|
|
by juancn
501 days ago
|
|
The compute scheduling part of the paper is also vey good, the way they balanced load to keep compute and communication in check. There is also a lot of thought put into all the tiny bits of optimization to reduce memory usage, using FP8 effectively without significant loss of precision nor dynamic range. None of the techniques by themselves are really mind blowing, but the whole of it is very well done. The DeepSeekV3 paper is really a good read: https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSee... |
|