| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by bbrazil 3705 days ago

> Not only that, only scaling vertically on a single node doesn't seem like a good design.

For Prometheus at least, we're so efficient that it actually works out okay for the vast majority of users. You'd typically need thousands of instances doing the same thing inside a single datacenter before you get into our (admittedly more involved) horizontal sharding approach.

http://www.robustperception.io/scaling-and-federating-promet... has more information.

> There are ways to poll things and push metrics without opening millions of firewall ports to every security group. Sensu does that quite well, and it scales.

I don't think that's quite a fair comparison. Sensu Just Works when there's no outbound firewall, Prometheus Just Works when there's no inbound firewall. If you add the other direction of firewall for either then things break down.

1 comments

Millenis 3705 days ago

Hi, honestly this feels like the conversation our team had a few years ago about Nagios. Some people were happy with pockets of monitoring servers dotted around with wiki pages full of links to get to them. It took a long time to clean that up and finally end up with a single central service.

The outbound vs inbound firewalls is totally a fair comparison. At almost every company I have worked for the perimeter security blocks all incoming ports. Most places have no outbound port restrictions, or when they do they usually always have a proxy for traffic to go out (like https for updates etc). This is what makes the design of outbound only connections considerably better (Sensu, Datadog, insert pretty much any SaaS vendor). I honestly don't know of any company who would open up such an extreme number of ports required to allow scraping from an external monitoring tool. In a large enterprise you want to host the monitoring tool as a service for other groups which means potentially in a totally different data center or cloud and allowing a small list of subnets 'in' is considerably better than exposing every single server to external access.

For the scaling I'm not sure I totally agree. There is a reason why distributed systems exist and that is to scale efficiently with some degree of redundancy baked in. Single node HA is probably the most inefficient method of scaling. It sounds like Prometheus needs to run properly on a single box for simplicity but over time needs to be broken up and made scalable beyond the bounds of a single server. Being limited to a single server in 2016, like Nagios was 10 years ago before they too started to split some things out, isn't something I'd advertise as a feature.

I think right now Prometheus is probably filling the gap of what an industrial historian would do in a factory. It's a console you would have sitting next to the thing you were building that would provide real time measurements. We didn't want lots of individual consoles. We wanted a central large system that could hold years of historical data, as well as allowing people to query it in real time, and we liked the idea of not needing to double up on hardware costs. We're using a mix of elasticsearch and cassandra here to achieve that.

link

bbrazil 3704 days ago

> Most places have no outbound port restrictions, or when they do they usually always have a proxy for traffic to go out (like https for updates etc)

Depends which company. For companies that really care about security, letting arbitrary traffic out is a big no-no. Even via proxy, it requires tight controls.

We've had potential users that were quite excited that Prometheus works the other way, as their network security team were likely to permit it.

> There is a reason why distributed systems exist and that is to scale efficiently with some degree of redundancy baked in.

The efficiency here is in humans (in theory anyway), not in resources or reliability. Distributed systems are a very hard problem, and we avoid those approaches for Prometheus as it's a critical monitoring system. CP systems like Kafka and Zookeeper are not things you want on your alerting path, as they'll fall apart when your network does. Prometheus will keep chugging along.

> Single node HA is probably the most inefficient method of scaling.

I'd disagree, the standard approach these days tends to be a cluster of three which'd use at least 50% more resources.

> It sounds like Prometheus needs to run properly on a single box for simplicity but over time needs to be broken up and made scalable beyond the bounds of a single server.

That's correct. If you manage to have enough targets inside a single datacenter (many thousands of machines), then we recommend vertical sharding first and only if that doesn't work horizontal sharding. Prometheus is really easy to run, so you'll likely end up for organisation reasons choosing to vertically sharding anyway so that each team can control their own decentralised monitoring.

> We didn't want lots of individual consoles.

With Grafana you can view things across many Prometheus servers.

> We wanted a central large system that could hold years of historical data,

Prometheus explicitly doesn't do historical data. As we're effectively talking unbounded amounts of storage that can't fit on one machine, which implies a distributed storage system. As mentioned above that's not something we want in our core for reliability.

The chosen approach is that we'll interface with something else such as OpenTSDB that'll do the long term storage, and we'll support seamlessly graphing across it. That's much easier to make reliable (just add a timeout), and if it does go down you'd still have the last few weeks of data sitting on the Prometheus box.

> as well as allowing people to query it in real time

That we do really well, PromQL is very powerful and anything you can graph you can also alert on.

> and we liked the idea of not needing to double up on hardware costs.

I feel this is a bit of a red herring. The question isn't whether there's a 2X multiplier in the math, it's the overall cost as compared to the benefits.

Prometheus is astoundingly efficient, the latest numbers are 800k samples/s on one machine. I haven't heard of anything else that is even close, and I believe we're also holding the record on storage efficiency.

Even 10X that cost isn't likely to break the bank, so I'd suggest taking a look at the full range of features it offers and comparing to the real world cost. The operational aspects you mention are generally manageable, and if they aren't your infrastructure likely has bigger problems.

link

64bitter 3704 days ago

> It sounds like Prometheus needs to run properly on a single box for simplicity but over time needs to be broken up and made scalable beyond the bounds of a single server.

I can understand the reasoning behind them building for a single node. Building distributed systems is hard, not everyone is capable of building these systems. They also require languages and frameworks suited to working in clusters. Golang might not scale too well, but its great for the simple things.

> The chosen approach is that we'll interface with something else such as OpenTSDB that'll do the long term storage, and we'll support seamlessly graphing across it. That's much easier to make reliable (just add a timeout), and if it does go down you'd still have the last few weeks of data sitting on the Prometheus box.

Doesn't OpenTSDB require zookeeper? So if your "network falls apart" you have no historical data? I guess thats fine if you're not alerting on data trends, and also have 10 minutes for your graphs to render.

link

bbrazil 3704 days ago

> I can understand the reasoning behind them building for a single node. Building distributed systems is hard, not everyone is capable of building these systems. They also require languages and frameworks suited to working in clusters. Golang might not scale too well, but its great for the simple things.

I'd say we're up to it, and Go is a great language for this class of system. We're however wise enough to know that even with the best people and tools this would take years to get production ready.

> So if your "network falls apart" you have no historical data? I guess thats fine if you're not alerting on data trends, and also have 10 minutes for your graphs to render.

Yes, you'd lose access to historical data (typically more than a few weeks ago). But everything else works, which should include the vast majority of your alerts.

link

Millenis 3704 days ago

The point on the ingress vs egress is that most systems already have a route out. To create a route in takes much more effort especially when you have NAT's etc. It's very nice to be able to spin up nodes and not have to care about opening firewall ports in security groups. By default AWS limits inbound and has no limitations on outbound (in classic ec2, things can be changed in vpc). Managing that security list centrally is far more auditable. I'm very surprised you've encountered anyone who thinks it's a great idea to open a port to every security group from a certain location (or many).

Like it or not you are beginning to split out Prometheus into a set of services regardless of the underlying belief that it all needs to fit onto a single node. That will only become more obvious over time as devices and metrics increase (which they very obviously will with containers and application metrics). There is a limit to a node, 800k metrics per box is not that huge when you consider the things that could be measure just on a single host. We have several thousand metrics coming out of just a single MySQL instance.

With Grafana you can only choose a single data source per widget. So you then can't overlay your data between prometheus nodes on the same graph. You can of course put them onto the same dashboard in different widgets but that is limiting. We chose a system where you could store all metrics across all systems at the same resolution and use them for analytics purposes on the same graphs. Having two layers of storage (robust.. on a single node :) this stretches the definition oddly) and then long term storage isn't desirable.

No matter how efficient you make things they won't keep up with the rate of metrics coming out of systems. Limiting to a single box 'because otherwise it's a hard problem' really doesn't seem like a great philosophy. It might be practical for now but longer term that's a very limited vision.

link

bbrazil 3704 days ago

> There is a limit to a node, 800k metrics per box is not that huge when you consider the things that could be measure just on a single host. We have several thousand metrics coming out of just a single MySQL instance.

To clarify, that's 800k metrics per second per box. That's an upper limit of around 50M metrics with a 60s scrape interval.

It'd be quite difficult to hit this limit with a single service inside one datacenter.

If you do manage to hit it, we do have documented ways to horizontally scale beyond that.

link

dieter_be 3704 days ago

> With Grafana you can only choose a single data source per widget.

this is not correct. Since 2.5 you can mix datasources. see http://docs.grafana.org/guides/whats-new-in-v2-5/

link