|
|
|
|
|
by d4l3k
353 days ago
|
|
We want to be tolerant to application bugs and host/GPU failures that can be solved by replacing/restarting the machine. External services and network failures we don't have much control over so aren't aiming to solve that. For specific types of failures check out the section on "Reliability and Operational Challenges" from the Llama 3 paper https://ai.meta.com/research/publications/the-llama-3-herd-o... |
|