Hacker News new | ask | show | jobs
Monitor and Optimize your large-scale model training (trainy.ai)
13 points by roanakb 1047 days ago
1 comments

This is really cool! When we were trying to launch the GSPMD feature for PyTorch/XLA at Google, one of our biggest bottlenecks was network overhead, but we didn't really have any robust tools to dig into it and perform root cause analysis. I'm loving the tools I see come out of Trainy.
Thanks! Let me know if there are any features you'd like to see added.