| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by baby_souffle 711 days ago

> you can detect stalled metrics (per host or service), who didn't send the data on time, etc

I guess the difference here is that we leverage service discovery in Prometheus for this instead of having to externally build an authoritative list of who/what should have pushed metrics.

> <...> and wait for a response.

As opposed to waiting for $thing to push metrics to you?

I guess I'm not convinced that one architecture is obviously better? There might be some downsides to a particular implementation but generally they both work and only external constraints will dictate which you use? E.g.: if you're required to ship metrics to multiple places, pushing to graphite and datadog becomes easier.

Anything that _should_ be scraped is tagged a certain way and anything that doesn't respond to a scrape gets flagged. After a few flags, an operator is paged. When $thing is destroyed or re-provisioned, different tags lead to a different set of $things to scrape metrics from.