In general, we use OpenTelemetry[1] for instrumenting our services in production, collecting metrics and logs for important events. Specifically, we have set up
- multiple dashboards informing us about current system usage (events received, processed) including e2e latency distributions, compute resource usage for different deployments, and top operations
- metrics on critical systems (data stores including Redis, messaging infrastructure, connection poolers for Postgres, etc.) to gauge current resource utilization and typical load patterns
- alerting on unexpected deviations in KPIs (a subset of the metrics above) to help us spot and react to issues quickly
- forecasting on product usage and compute resource utilization patterns for planning medium to long-term infrastructure work
In general, we use OpenTelemetry[1] for instrumenting our services in production, collecting metrics and logs for important events. Specifically, we have set up
- multiple dashboards informing us about current system usage (events received, processed) including e2e latency distributions, compute resource usage for different deployments, and top operations
- metrics on critical systems (data stores including Redis, messaging infrastructure, connection poolers for Postgres, etc.) to gauge current resource utilization and typical load patterns
- alerting on unexpected deviations in KPIs (a subset of the metrics above) to help us spot and react to issues quickly
- forecasting on product usage and compute resource utilization patterns for planning medium to long-term infrastructure work
Hope this helps!
[1]: https://opentelemetry.io/