Y
Hacker News
new
|
ask
|
show
|
jobs
Instrumentation checklist for running large GPU clusters
(
together.ai
)
2 points
by
amaldavid
673 days ago
1 comments
amaldavid
673 days ago
Just stumbled upon this blog which details out testing and validating large GPU clusters before running training workloads. Any other similar blogs which adds more nuance in terms of debugging the issues once we identify them as well?
link
roanakb
673 days ago
this is a good one for debugging rdma:
https://docs.redhat.com/en/documentation/red_hat_enterprise_...
link