Y
Hacker News
new
|
ask
|
show
|
jobs
Silent Data Corruptions: The Boogeyman of LLM Training
(
adept.ai
)
31 points
by
jmintz
991 days ago
5 comments
auraham
991 days ago
Interesting post. It would be much better if the author included a few code snippets to show how to identify the failing GPU during training.
link
ejro
991 days ago
Interesting. This is probably a universal problem for large model training but not being discussed enough.
link
adeptlo
991 days ago
Super interesting problem that's affecting more people than they probably realize.
link
osavant
991 days ago
Super interesting, thanks for putting this together
link
ibeitia
991 days ago
Fascinating read!
link