|
|
|
|
|
by laidoffamazon
295 days ago
|
|
Interesting. My assumption was one of the innovations of DeepSeek and the modern GPT models was performing low precision pretraining rather than just finetuning further. I didn't realize you still need accumulation at a higher precision anyway |
|