Hacker News new | ask | show | jobs
by the_evacuator 3190 days ago
Prometheus is an escaped implementation of Google’s borgmon, which is seen inside Google as a kind of horror show, and alternatives have been developed. It is kind of frightening that it has got out in the wild and people like it.
8 comments

"Borgmon is the worst form of monitoring, except for all those other forms that have been tried from time to time."

-- Winston Churchill if he worked at Google.

Having worked both on borgmon and with borgmon, I can think of a few reasons why some (many?) don't like it all that much:

1. As already mentioned, the macro system (and the fact that its use is basically required to set up basic monitoring) has quite a steep learning curve. Prometheus doesn't have that and I personally would prefer it did.

2. It's not a service, so you have to set up your own instance, configure it, maintain it. In many engineers' mind this is just another hurdle in front of them launching their service.

3. As a software engineer (particularly new to Google) you might not expect to have to do ops work and carry a pager yourself.

Out of all three, only (1) is a valid reason to hate on borgmon. That and the language itself, which is almost a 1:1 match with Prometheus, are very different from your regular programming language. But given the choice between flat, simple metrics (which is what most monitoring systems give you) and the ability to have arbitrary dimensions and be able to work with them to build useful alerts and dashboards and troubleshoot quickly, I (again personal opinion) will always go with the latter.

Why is borgmon considered a horror show? Is something fundamentally flawed in the model? How do the alternatives differ from the original borgmon?
Google has an alternative that they gave a talk on back in December. Sadly there aren't any papers on it yet. It's called Monarch and it's what backs up Stackdriver.

It's config language is less crazy (Python based) and operates globally.

https://www.youtube.com/watch?v=LlvJdK1xsl4

Edit: Monarch config isn't sane, it's just different and at least not in the crazy languages that borgmon uses.

I will point out that borgmon's language (minus macros) is almost a 1:1 match with Prometheus. You can judge for yourself how crazy that is, but I feel that it's close to as simple as you can get for the power it gives you.

As for Monarch, it's a very different beast. For one, it stores all its rules in a protocol buffer format, so it's more structured. But then you have to write Python code that generates the protocol buffers and pushes them to storage. It looks similar but not the same as the ad-hoc query language. I wouldn't go as far as calling it sane.

It is also a service and it's optimized for Google's network architecture with datacenter local and global nodes and the language itself is aware of this distinction and some computations are done locally, others globally and so on.

For your local monitoring needs (or even global ones, if you're willing to put in the effort), Prometheus is a solid choice.

I'd agree with you after thinking about it some. I haven't really written either, mainly either copy/paste or using tools to assist in creation. So I can't really judge either monitoring language on their ease of use.
"sane" is an interesting choice of words to describe Monarch configuration...
Borgmon's language is weird and crusty. People can debate whether this is the main problem with Borgmon or whether more fundamental changes are necessary.

I won't weight in on that debate. But you can think of Prometheus as an experiment to decide the issue: it is very similar to Borgmon, but has a cleaner language.

And what would be a better alternative available outside of Google?
The decade old, open source, self hosted, debug it yourself standard for monitoring is collectd+graphite+grafana.

The equivalent easy to setup and use, with ALL the features working out of the box, SaaS standard is https://www.datadoghq.com/ or potentially Google Stack driver if you are on Google Cloud.

Plugging https://www.librato.com/ after reading an HN thread yesterday that DataDog pricing is insane [1].

[1] https://news.ycombinator.com/item?id=15315028

Disclaimer: No relation to either org.

The thread you are linking to is a complete joke. The guy didn't see that the pricing was per host, even though it's written in big letters. His whole series of rant is ridiculous.
Stack Driver is a Google Cloud service, you need a Google account with billing to use it. It can gather metrics from hundreds of vendors, including AWS and Azure.
It wasn't universally liked at Cloudflare. The federation component in particular is a PITA.
> alternatives have been developed

Such as Prometheus :) Even some teams in Google use it.

So you have any reason to not like it?
Not really. I just think it’s funny to see it described by one groups as a reasonable or even state-of-the-art system, while another group describes it as brain damage from ten years ago.
Borgmon being "brain damage" or some kind of "horror show" is more a meme than a serious opinion held by people who have used Borgmon.

Personally I'm very happy that the open source world is adopting something derived from Borgmon rather than something derived from its supposed "replacement".

Yup, I've had more than one current Google SRE state "You can have my Borgmon when you pry it from my cold dead hands.".

Borgmon may be dead in the eyes of some people, but I know for a fact that it's still the only thing monitoring core and critical systems.

Most of the problem with Borgmon, IMO, is the cruft that has built up over the decade+, and neglect due to the Google pattern of "The new thing that doesn't work, and the old thing that is deprecated.".

The difficulty at Google is that developers are rewarded for writing new and shiny from scratch, rather than fix the old but working systems.

This isn't always a problem, as some good things can come out of starting from scratch. But sometimes they throw out too many of the good ideas, in an attempt to be fancy and new.

I have heard unflattering things about it from a few different ex-Google SREs, specifically about the macro system and it being cumbersome to use.
You will note that Prometheus explicitly does not have a macro system.
Oh sure, I was responding to the OP stating that criticism of Borgmon was more a meme than reality.

It wasn't meant to be commentary on Prometheus(which I quite like) at all :)

It's pretty normal when you consider that google is 10 years ahead of almost every other company when it comes to infrastructure.
> is 10 years ahead of almost every other company when it comes to infrastructure.

For Google-scale orgs or infrastructure needs. Most everyone else in the world does not need Google scale tools.

Google needs are extremely common. Take a look at any Fortune 500 and and it could usually benefit greatly from a lot of the infrastructure that powers google.

Most of them do run their own datacenters, sometimes in numerous locations, they have massive and extremely complex IT systems in place.