|
|
|
|
|
by Millenis
3701 days ago
|
|
The point on the ingress vs egress is that most systems already have a route out. To create a route in takes much more effort especially when you have NAT's etc. It's very nice to be able to spin up nodes and not have to care about opening firewall ports in security groups. By default AWS limits inbound and has no limitations on outbound (in classic ec2, things can be changed in vpc). Managing that security list centrally is far more auditable. I'm very surprised you've encountered anyone who thinks it's a great idea to open a port to every security group from a certain location (or many). Like it or not you are beginning to split out Prometheus into a set of services regardless of the underlying belief that it all needs to fit onto a single node. That will only become more obvious over time as devices and metrics increase (which they very obviously will with containers and application metrics). There is a limit to a node, 800k metrics per box is not that huge when you consider the things that could be measure just on a single host. We have several thousand metrics coming out of just a single MySQL instance. With Grafana you can only choose a single data source per widget. So you then can't overlay your data between prometheus nodes on the same graph. You can of course put them onto the same dashboard in different widgets but that is limiting. We chose a system where you could store all metrics across all systems at the same resolution and use them for analytics purposes on the same graphs. Having two layers of storage (robust.. on a single node :) this stretches the definition oddly) and then long term storage isn't desirable. No matter how efficient you make things they won't keep up with the rate of metrics coming out of systems. Limiting to a single box 'because otherwise it's a hard problem' really doesn't seem like a great philosophy. It might be practical for now but longer term that's a very limited vision. |
|
To clarify, that's 800k metrics per second per box. That's an upper limit of around 50M metrics with a 60s scrape interval.
It'd be quite difficult to hit this limit with a single service inside one datacenter.
If you do manage to hit it, we do have documented ways to horizontally scale beyond that.