| Thank you for your reply and clarification. This is quite an interesting topic for me as I've tested and implemented similar setups. > Does image processing, runs our analytics, [...] Fair enough, I was strictly going by the diagrams.
From my experience with a somewhat similar setup (HA Loki, HA Prom + Thanos with a MinIO storage backend using Terraform + Ansible and docker) I have to say that the most complex and frustrating part was configuring Loki (this was way before they expanded their documentation, which still isn't great). I'd imagine this would be even more challenging under k8s at least if you stray from the vanilla deployment and/or charts. I agree with your statement regarding Ceph, we use it extensively in production (probably on a much bigger scale). However, I think Ceph, unlike MinIO, just adds unnecessary complexity to your setup. > Well yes and no, the number of metrics isn't relevant per se, but its cardinality is very relevant [...] Cardinality is something you should avoid when using Prometheus - for exactly that reason. There are, in my opinion, very few good reasons for dynamic labels (ignoring the baked-in cardinality from a setup like k8s). On first impulse I'd say you're doing metrics wrong but then again, I do not know enough about your use case. Maxing out a single Instance of Prometheus is no easy feat however, especially if your infra isn't that complex and/or big.
I've used Thanos for so long now, how does the Cortex compactor handle range queries? Does it also compact and create additional 5m & 1h resolution metrics? These might help with your larger range queries. Just out of curiosity, have you had any look at alternatives like Victoriametrics? > 7k logs per second [...] are nothing My remark was just regarding the added complexity as this depends solely on the size of your log messages. If you don't need or use the (awesome!) capabilities of Loki + Grafana and just need a place for long-term storage of your logs, a 'simple' rsyslog server will do just fine. > we collect prometheus metrics for MySQL, PHP-FPM, Varnish [...] Many (if not all) of these can be handled by Telegraf or Fluentd plus InfluxDB (not that I'd used that myself, I absolutely love Prometheus and its Eco-system). My tongue-in-cheek comment was mostly about the Prometheus instance you deploy on every server just to scrape metrics locally and remote-write them into Cortex. Why not the more usual setup of (one or more) Prometheus instances scraping their targets and writing to Cortex? |