Hacker News new | ask | show | jobs
by SuperQue 1545 days ago
One interesting question I have is regards to global availability.

With our current Thanos deployment, we can tie a single geo regional deployment together with a tiered query engine.

Basically like this:

"Global Query Layer" -> "Zone Cluster Query Layer" -> "Prom Sidecar / Thanos Store"

We can duplicate the "Global Query Layer" in multiple geo regions with their own replicated Grafana instances. If a single region/zone has trouble we can still access metrics in other regions/zones. This avoids Thanos having any SPoFs for large multi-user(Dev/SRE) orgs.

2 comments

This is one of my favorite things about Thanos. We run Prometheus in multiple private datacenters, multiple AWS regions across multiple AWS accounts, and multiple Azure regions across multiple subscriptions. We have three global labels: cloud, region, and environment. With Thanos's Store/Querier architecture we have a single Datasource in Grafana where we can quickly query any metric from any environment across the breadth of our infrastructure.

It's really a shame that Loki in particular doesn't share this kind of architecture. Seems like Mimir, frustratingly, will share this deficiency.

The typical way to run Mimir is centralised, with different regions/datacenters feeding metrics in to one place. You can run that central system across multiple AZs.

If you run Mimir with an object store (e.g. S3) that supports replication then you can have copies in multiple geographies and query them, but the copies will not have the most recent data.

(Note I work on Mimir)