I've commented fairly heavily in the related Grafana thread.
Prometheus is a bit of a different story. It does have some operational overhead when you get to a certain point, and scaling it out is not always trivial.
Assuming it works, there is value-add on this one, and the pricing is more in line with active use (ie, a cost+ model, which is more typical of AWS services)
This seems more interesting of the two, grafana is pretty simple to setup and maintain. The harder part is handling the metrics themselves, be it with influxdb, prometheus, or something else.
It's a completely different problem because by default Prometheus does not shard anything so you're bound to a single instance, where ES and Kafka are cluster based.
Out of interest what do you find hard about running ElasticSearch clusters?
In my experience ES has been one of the easiest clustered / highly available and sharded systems I've ever run - especially for how incredibly performant and reliable it is.
I've generally found that beyond right sizing your nodes, indexes and shard configuration - it pretty much just works without ever really having issues.
It's not a drop-in replacement (even though it tries to sell itself as such), it's incompatible in a significant number of ways and throws away part of your data.
We use Victoria Metrics in Prod for more than 6 months. It is very reliable and scalable. Victoria metrics handles more than 2B+ series in our setup without breaking a sweat.
Prometheus is a bit of a different story. It does have some operational overhead when you get to a certain point, and scaling it out is not always trivial.
Assuming it works, there is value-add on this one, and the pricing is more in line with active use (ie, a cost+ model, which is more typical of AWS services)