Hacker News new | ask | show | jobs
by to11mtm 529 days ago
> I’ve been pushing the use of Datadog for years but their pricing is out of control for anyone between mid size company and large enterprises

Not a fan of datadog vs just good metric collection. OTOH while I see the value of OTEL vs what I prefer to do... in theory.

My biggest problem with all of the APM vendors, once you have kernel hooks via your magical agent all sorts of fun things come up that developers can't explain.

My favorite example: At another shop we eventually adopted Dynatrace. Thankfully our app already had enough built-in metrics that a lead SRE considered it a 'model' for how to do instrumentation... I say that because, as soon as Dynatrace agents got installed on the app hosts, we started having various 'heisenbugs' requiring node restarts as well as a directly measured drop in performance. [0]

Ironically, the metrics saved us from grief, yet nobody had an idea how to fix it. ;_;

[0] - Curiously, the 'worst' one was MSSQL failovers on update somehow polluting our ADO.NET connection pools in a bad way...

1 comments

> I say that because, as soon as Dynatrace agents got installed on the app hosts, we started having various 'heisenbugs' requiring node restarts

Our containers regularly fail due vague LD_PRELOAD errors. Nobody has invested the time to figure out what the issue is because it usually goes away after restarting; the issue is intermittent and non-blocking, yet constant.

It's miserable.

We do at least one rolling restart a day because it’s the best way to GC. And we’re not using any APM yet