Hacker News new | ask | show | jobs
by user5994461 2160 days ago
Oh right that's not much, only 3000 metrics per host. I guess we see more scaling problems because we have more metrics and tags (physical hosts with 128 CPU, imagine the per CPU metrics just to begin with). I spent days as well testing all the collectors and tuning things carefully.

That's all very interesting.

1 comments

Sorry I just picked one datacenter, so, 1/5h of the hosts or 15k metrics per host with some rudimentary math!

Our data lives on each datacenter only and then we query cross datacenter via grafana when we need.

Alright. More data in total but less per server because it's distributed.

We run a pair of servers both storing everything. It's the least that can be done to have any resiliency.

I would love to distribute the data, preferably per continent, but prometheus didn't have a good story on sharding. Running independent dataset is worthless in practice without the ability to aggregate. Also, the more servers the more expensive (and they're not easy to procure). Running 6 prometheus servers is in the same ballpark as paying for datadog, so might as well just pay for it.

We don't care so much about resiliency, we do backup the prometheus folder using the snapshot api.

There are a couple of articles about sharding and federation with prometheus, dunno if they existed when you tried it.

For us our problems are usually local to a datacenter. Having a dropdown where you can pick the datacenter has proven good enough. It is unlikely that we have a global issue in a service.

Sorry if unclear but we have our own datacenters, our prometheus VMs are essentially free in the grand scheme of things considering the number of compute we have.