|
|
|
|
|
by EdwardDiego
2144 days ago
|
|
We've used Thanos to aggregate multiple Prometheus (Promethii?) across our clusters to enable us to scale, each Prometheus deals only with a subset of scrape targets. Biggest issue I've had was an app that was accidentally publishing several thousand metrics which caused the default scrape timeout of 15s to kick in. (It was publishing Kafka lag per consumer group per topic, which was fine and dandy, until someone released an app that runs about 500 instances at peak, and scaled up and down frequently, and had incorporated the pod id into the consumer group names, which led to Kafka tracking many, many, many consumer groups. Given that the consumers were low value anyway, we now just exclude them from having their lag tracked.) |
|
Prometheuses.
ii is for latin words. Prometheus is/was Greek. I guess you could use Prometheoí but it would quickly derail any conversation. :)