Hacker News new | ask | show | jobs
by netingle 1543 days ago
(Tom here; I started the Cortex project on which Mimir is based and lead the team behind Mimir)

Thanos is an awesome piece of software, and the Thanos team have done a great job building an vibrant community. I'm a big fan - so much so we used Thanos' storage in Cortex.

Mimir builds on this and makes it even more scalable and performance (with a sharded compactor and query engine). Mimir is multitenant from day 1, whereas this is a relatively new thing in Thanos I believe. Mimir has a slightly different deployment model to Thanos, but honestly even this is converging.

Generally: choosing Thanos is always going to be a good choice, but IMO choosing Mimir is an even better one :-p

2 comments

Okay, but why? I am using Thanos today. It works, it's complex, when it breaks, it's a bit of a challenge to fix, but it happens. It doesn't break often.

It does the job. Mimir, which is based on Cortex, using either Mimir, or Cortex, what benefit am I getting?

I get asked every few months about moving off of Thanos to Cortex, and today now Mimir, and I don't have any substantial reason to do so. It feels like moving for the sake of moving.

I need to see some real reasoning as to why I am going to add value to move everything to Mimir.

Sounds like Thanos is working well for you, so in your position I wouldn't change anything.

There are a bunch of other reasons why people might choose Mimir; perhaps they have out grown some of the scalability limits, or perhaps they want faster high cardinality queries, or a different take on multi-tenancy.

Do remember Cortex (on which Mimir is based) predates Thanos as a project; Thanos was started to pursue a different architecture and storage concept. Thanos storage was clearly the way forward, so we adopted it. The architectures are still different: Thanos is "edge"-style IMO, Mimir is more centralised. Some people have a preference for one over the other.

That's fair, thanks for the input. The only reason we implemented Thanos in the first place was a particular feature that we needed at the time of implementation. Now using it in an extremely large environment, I haven't seen any scalability limits. Speed of queries isn't a driver of anything.

Multi Tenancy certainly is, but we have our own custom multi tenancy solution over top of it we built ourselves. I'd like to get rid of that ultimately, but we're not utilizing whatever multi tenant features exist at the moment. Perhaps that will be a driver.

Appreciate your thoughts.

We were struggling with Cortex a couple years ago, then we tried VictoriaMetrics and haven't look back. It goes pretty much unattended with just monitoring disk space to make sure we still have room to continue pouring in metrics. When a component crashes (not often) it recovers pretty much without noticing.
Multi-tenancy is something that shouldn't be underestimated. A lot of people think it's just a checklist item until (a) they need it or (b) they try to implement it in an existing system. Kudos for making it a day-one feature.
While I agree with your point in the general case, would you mind elaborating on the specific case of Prometheus?

My understanding is that the recommended best-practice for Prometheus is to deploy as many of them as necessary, as close to the monitored infrastructure as possible.

What use case would require deploying a single Mimir, so supposedly Prometheus (cluster) in the case of serving multiple tenants? Why not just deploy a dedicated Prometheus / Mimir stack per client?

I don't know Prometheus, but I would imagine the answer depends on just how many clients you have. Probably doesn't matter if you're talking just a few. If it's a lot, then separate instances can be very expensive in terms of operational complexity and waste due to resource fragmentation. Multi-tenancy is good for bringing both of those back under control. Is there something about Prometheus that would negate that?
For one, it doesn't really support authentication (although it's on the roadmap).

I'm no Prometheus expert, but since you're pretty much expected to be running a bunch of servers anyway, the operational complexity has to be handled even for just one client.

You do have a point on resource fragmentation, but IME Prometheus' resource usage is fairly predictable, so you could probably mitigate that to a point.