Hacker News new | ask | show | jobs
by sundalia 665 days ago
Application-specific metrics are the way to go. For ML training this is one example: https://cloud.google.com/blog/products/ai-machine-learning/g...
1 comments

Nice, seems like ML Productivity Goodput is a pretty well thought-out metric to understand the overall efficiency of your cluster. I'll consider adding this into our cluster management platform. Only potential drawbacks I'd guess are it being somewhat difficult to compute since it relies on metrics like MFUs, and not something we can observe layer-by-layer to understand inefficient kernels, but I'll take a deeper look. Thanks!