Hacker News new | ask | show | jobs
by badmonster 408 days ago
What metrics and Kubernetes runtime data does Neurox collect to provide its AI workload monitoring dashboards, and how customizable are these dashboards for different user roles like developers or finance auditors?
1 comments

We collect a handful of metrics, but coming from our previous lives in DevOps, we only collect just what's needed to avoid unnecessary metrics bloat.

The main 3 are:

- GPU runtime stats from NVIDIA smi

- Running pods from Kube state

- Node data & events from Kube state

We have several screens with similar information intended for different roles. For example, the Workloads screen is mainly for researchers to monitor their workloads from creation to completion. The Reports screen shows mainly cost data grouped by team/project, etc.