| HN Mirror

Bruno from Inngest here, thanks for asking!

In general, we use OpenTelemetry[1] for instrumenting our services in production, collecting metrics and logs for important events. Specifically, we have set up

- multiple dashboards informing us about current system usage (events received, processed) including e2e latency distributions, compute resource usage for different deployments, and top operations

- metrics on critical systems (data stores including Redis, messaging infrastructure, connection poolers for Postgres, etc.) to gauge current resource utilization and typical load patterns

- alerting on unexpected deviations in KPIs (a subset of the metrics above) to help us spot and react to issues quickly

- forecasting on product usage and compute resource utilization patterns for planning medium to long-term infrastructure work

Hope this helps!

[1]: https://opentelemetry.io/