|
|
|
|
|
by omerhac
720 days ago
|
|
This is such a valuable piece.
I've learned so much reading it! And your open-source code is great as well. Some open questions I have:
1) Why did you choose to setup your own cluster? How was the experience with your cloud partner regarding faulty machines / switches?
2) What were your considerations choosing the cluster architecture that have proven the most valuable ? (apart from the all2all comms)
3) Can you share a bit more about your logging infra apart from the fact that it was Loki based?
4) What necessitated the use of a local docker registry? did you use other images apart from nvidia-container-runtime? Thanks! |
|