|
|
|
|
|
by Millenis
3701 days ago
|
|
Hi, honestly this feels like the conversation our team had a few years ago about Nagios. Some people were happy with pockets of monitoring servers dotted around with wiki pages full of links to get to them. It took a long time to clean that up and finally end up with a single central service. The outbound vs inbound firewalls is totally a fair comparison. At almost every company I have worked for the perimeter security blocks all incoming ports. Most places have no outbound port restrictions, or when they do they usually always have a proxy for traffic to go out (like https for updates etc). This is what makes the design of outbound only connections considerably better (Sensu, Datadog, insert pretty much any SaaS vendor). I honestly don't know of any company who would open up such an extreme number of ports required to allow scraping from an external monitoring tool. In a large enterprise you want to host the monitoring tool as a service for other groups which means potentially in a totally different data center or cloud and allowing a small list of subnets 'in' is considerably better than exposing every single server to external access. For the scaling I'm not sure I totally agree. There is a reason why distributed systems exist and that is to scale efficiently with some degree of redundancy baked in. Single node HA is probably the most inefficient method of scaling. It sounds like Prometheus needs to run properly on a single box for simplicity but over time needs to be broken up and made scalable beyond the bounds of a single server. Being limited to a single server in 2016, like Nagios was 10 years ago before they too started to split some things out, isn't something I'd advertise as a feature. I think right now Prometheus is probably filling the gap of what an industrial historian would do in a factory. It's a console you would have sitting next to the thing you were building that would provide real time measurements. We didn't want lots of individual consoles. We wanted a central large system that could hold years of historical data, as well as allowing people to query it in real time, and we liked the idea of not needing to double up on hardware costs. We're using a mix of elasticsearch and cassandra here to achieve that. |
|
Depends which company. For companies that really care about security, letting arbitrary traffic out is a big no-no. Even via proxy, it requires tight controls.
We've had potential users that were quite excited that Prometheus works the other way, as their network security team were likely to permit it.
> There is a reason why distributed systems exist and that is to scale efficiently with some degree of redundancy baked in.
The efficiency here is in humans (in theory anyway), not in resources or reliability. Distributed systems are a very hard problem, and we avoid those approaches for Prometheus as it's a critical monitoring system. CP systems like Kafka and Zookeeper are not things you want on your alerting path, as they'll fall apart when your network does. Prometheus will keep chugging along.
> Single node HA is probably the most inefficient method of scaling.
I'd disagree, the standard approach these days tends to be a cluster of three which'd use at least 50% more resources.
> It sounds like Prometheus needs to run properly on a single box for simplicity but over time needs to be broken up and made scalable beyond the bounds of a single server.
That's correct. If you manage to have enough targets inside a single datacenter (many thousands of machines), then we recommend vertical sharding first and only if that doesn't work horizontal sharding. Prometheus is really easy to run, so you'll likely end up for organisation reasons choosing to vertically sharding anyway so that each team can control their own decentralised monitoring.
> We didn't want lots of individual consoles.
With Grafana you can view things across many Prometheus servers.
> We wanted a central large system that could hold years of historical data,
Prometheus explicitly doesn't do historical data. As we're effectively talking unbounded amounts of storage that can't fit on one machine, which implies a distributed storage system. As mentioned above that's not something we want in our core for reliability.
The chosen approach is that we'll interface with something else such as OpenTSDB that'll do the long term storage, and we'll support seamlessly graphing across it. That's much easier to make reliable (just add a timeout), and if it does go down you'd still have the last few weeks of data sitting on the Prometheus box.
> as well as allowing people to query it in real time
That we do really well, PromQL is very powerful and anything you can graph you can also alert on.
> and we liked the idea of not needing to double up on hardware costs.
I feel this is a bit of a red herring. The question isn't whether there's a 2X multiplier in the math, it's the overall cost as compared to the benefits.
Prometheus is astoundingly efficient, the latest numbers are 800k samples/s on one machine. I haven't heard of anything else that is even close, and I believe we're also holding the record on storage efficiency.
Even 10X that cost isn't likely to break the bank, so I'd suggest taking a look at the full range of features it offers and comparing to the real world cost. The operational aspects you mention are generally manageable, and if they aren't your infrastructure likely has bigger problems.