What metrics and Kubernetes runtime data does Neurox collect to provide its AI workload monitoring dashboards, and how customizable are these dashboards for different user roles like developers or finance auditors?
We collect a handful of metrics, but coming from our previous lives in DevOps, we only collect just what's needed to avoid unnecessary metrics bloat.
The main 3 are:
- GPU runtime stats from NVIDIA smi
- Running pods from Kube state
- Node data & events from Kube state
We have several screens with similar information intended for different roles. For example, the Workloads screen is mainly for researchers to monitor their workloads from creation to completion. The Reports screen shows mainly cost data grouped by team/project, etc.
The main 3 are:
- GPU runtime stats from NVIDIA smi
- Running pods from Kube state
- Node data & events from Kube state
We have several screens with similar information intended for different roles. For example, the Workloads screen is mainly for researchers to monitor their workloads from creation to completion. The Reports screen shows mainly cost data grouped by team/project, etc.