Hacker News new | ask | show | jobs
by anonymousDan 360 days ago
What kind of failures are you typically concerned with here?
1 comments

We want to be tolerant to application bugs and host/GPU failures that can be solved by replacing/restarting the machine. External services and network failures we don't have much control over so aren't aiming to solve that.

For specific types of failures check out the section on "Reliability and Operational Challenges" from the Llama 3 paper https://ai.meta.com/research/publications/the-llama-3-herd-o...