Hacker News new | ask | show | jobs
by omerhac 720 days ago
This is such a valuable piece. I've learned so much reading it! And your open-source code is great as well.

Some open questions I have: 1) Why did you choose to setup your own cluster? How was the experience with your cloud partner regarding faulty machines / switches? 2) What were your considerations choosing the cluster architecture that have proven the most valuable ? (apart from the all2all comms) 3) Can you share a bit more about your logging infra apart from the fact that it was Loki based? 4) What necessitated the use of a local docker registry? did you use other images apart from nvidia-container-runtime?

Thanks!