|
|
|
|
|
by dijit
38 days ago
|
|
"Getting it running" is the easy part. "Getting it ready for production" is a different game. I've fallen on my sword many times by trying to explain that prometheus fails every metric of production ready; in fact Google themselves replaced borgmon (prometheus) for Monarch because the "tiny unreliable time series databases everywhere" was in fact, not the successful and reliable deployment strategy that they had claimed. But, it is very easy to set up. Just don't go looking for failure modes, because they're everywhere and every single one of them is catastrophic. |
|
See this PR for example (https://github.com/prometheus/prometheus/pull/18364) - this used to impact a production deployment I worked on. Prometheus, Thanos and even OpenTelemetry are full of those kind of problems - but at the same time it's the best we have and we should be grateful they're free and open source.
I'd still choose an open source stack (and contribute to it) rather than go for a proprietary solution - we've all seen what happens with DataDog & co.
Please don't take my words lightly, I worked with the rest of my team in a large scale observability platform and scalability should not be underestimated - at the same time DataDog / Splunk prices are simply unjustified. It's ironically cheaper to build a team of engineers that will maintain a sane observability stack instead of feeding the monster(s).