Hacker News new | ask | show | jobs
by laidoffamazon 295 days ago
Interesting. My assumption was one of the innovations of DeepSeek and the modern GPT models was performing low precision pretraining rather than just finetuning further. I didn't realize you still need accumulation at a higher precision anyway