Hacker News new | ask | show | jobs
by uaas 679 days ago
You cannot go wrong with the most popular choice: Prometheus/Grafana stack. That includes node_exporter for anything host related, and optionally Loki (and one of its agents) for logs. All this can run anywhere, not just on k8s.
4 comments

Related, have there been any 'truly open-source' forks of Grafana since their license change? Or does anyone know of good Grafana alternatives from FOSS devs in general? My default right now is to just use Prometheus itself, but I miss some of the dashboard functionality etc. from Grafana.

Grafana's license change to AGPLv3 (I suspect to drive their enterprise sales), combined with an experience I had reporting security vulnerabilities, combined with seeing changes like this[1] not get integrated has left a bad taste in my mouth.

[1] https://github.com/grafana/grafana/pull/6627

AGPLv3 is a completely valid choice for an open source license, and (not that it was necessarily questioned, but since critique of pushing enterprise sales comes up,) having a split open source/enterprise license structure is not particularly egregious and definitely not new. Some people definitely don't like it, but even Richard Stallman is generally approving of this model[1]. It's hard to find someone more ideologically-oriented towards the success and proliferation of free and open source software, though that obviously doesn't mean everyone agrees.

I'm not saying, FWIW, that I think AGPL is "good", but it is at least a perfectly valid open source license. I'm well aware of the criticisms of it in general. But if you're going to relicense an open source project to "defend" it against abuse, AGPL is probably the most difficult to find any objection to. It literally exists for that reason.

I don't necessarily think that Grafana is the greatest company ever or anything, but I think these gripes are relatively minor in the grand scheme of things. (Well, the security issue might be a bit more serious, but without context I can't judge that one.)

[1]: https://www.fsf.org/blogs/rms/selling-exceptions

To be fair, AGPLv3 is a very valid open source licence.

Now, poor and bad behaviour from the prom maintainers is a very fertile subject. If you want to see some real spicy threads check out the one where people raised that Prom’s calculation of rate is incorrect, or the thread where people asked for prom to interpolate secrets into its config from env cars - like every other bit of common cloud-adjacent software.

Both times prom devs behaved pretty poorly and left really bad taste in my mouth. Victoria Metrics seems like a much better replacement.

The AGPL is probably the best option for a FOSS license. Why do you consider it "not truly open-source"?
AGPL prevents from wide product adoption, since corporate lawyers caution against relying on AGPL products because it is easy to violate the license terms and being sued after that.
It's not possible to sell non-FOSS modifications to AGPL-licensed software. I think that's intended. It's not antithetical to Open Source, quite the opposite in fact.
Good. Doesn't prevent it from using (not selling) in your company.
Yeah, but lawyers (and companies where these lawyers work) are afraid of licenses with unclear or vague terms such as GPL, LGPL, AGPL, BSL, etc. They prefer to deal with software licensed under clear and concise open-source licenses such as Apache2, MIT and BSD.
If lawyers micro-manage what you use for your internal tooling you have lost. You can't work anymore.

If lawyers are afraid of licenses they should change their profession.

Do those companies really care about open-source, or just about code they can freely integrate into their proprietary products?
> combined with them not being a good steward for changes like this[1] left a bad taste in my mouth.

What they did wrong with this PR? It seems eventually they realized the scope was much bigger, requiring changes on both the frontend and backend, and asked potential contributors to reach out if they're interested in contributing that particular feature (saying between the lines that they themselves don't have a use, but they won't reject a PR).

Seems like they didn't need it themselves, and asked the community to contribute it if someone really wanted it, but no one has stepped up since then.

I think the word is rather use is copyleft! Agpl is fully open source in its truest sense! It’s so open that it ensures it always stays open!
The FOSS alternative to Grafana is Grafana, which is FOSS. More FOSS than it was before, actually.
can y explain the problem with that github pull request? I did not get it
I'm using VictoriaMetrics instead of Prometheus, am doing something wrong? I have zabbix as well as node_exporter and Percona PMM for mysql servers because sometimes it is hard to configure prometheus stack for snmp when zabbix cover this case out of the box.
Prometheus itself is pretty simple, fairly robust, but doesn’t necessarily scale for long-term storage as well. Things like VictoriaMetrics, Mimir, and Thanos tend to be a bit more scalable for longer term storage of metrics.

For a few hundred gigs of metrics, I’ve been fine with Prometheus and some ZFS-send backups.

Just to expand upon some experiences with some of the listed software.

The architecture is quite different between Thanos and the others you've listed as unlike the others, Thanos queries fan out to remote Prometheus instances for hot data and then ship out data (typically older than 2 hours) via a sidecar to s3 storage. As the routing of the query depends on setting Prometheus external labels, our developer queries would often fan out unnecessarily to multiple prometheus instances. This is because our developers often search for metrics via a service name or some service related label rather than use an external label which describes the location of the workload which is used by Thanos.

Upon identifying this, I migrated to Mimir and we saw immediate drops in query response times for developer queries which now don't have to wait for the slowest promethues instances before displaying the data.

We've also since adopted OpenTelemetry in our workloads and directly ingest otlp in to Mimir (Which VictoriaMetrics also support).

I wrote an extensive reply to this but unfortunately the HN servers restarted and lost it.

The TL;DR was that from where I stand, you’re doing nothing wrong.

In a previous client we ran Prometheus for months, then Thanos, and eventually we implemented Victoria Metrics and everyone was happy. It became an order of magnitude cheaper due to using spinning rust for storage and still getting better performance. It was infinitely and very easily scalable, mostly automatically.

The “non-compliant” bits of the query language turned out to be fixes to the UX and other issues. Lots of new functions and features.

Support was always excellent.

I’m not affiliated with them in any way. Was always just a very happy freeloading user.

I have deployed lots of metrics systems, starting with cacti and moving through graphite, kairosdb (which used Cassandra under the hood), Prometheus, Thanos and now Mimir.

What I've realised is that they're all painful to scale 'really big'. One Prometheus server is easy. And you can scale vertically and go pretty big. But you need to think about redundancy, and you want to avoid ending up accidentally running 50 Prometheus instances, because that becomes a pain for the Grafana people. Unless you use an aggregating proxy like Promxy. But even then you have issues running aggregating functions across all of the instances. You need to think about expiring old data and possibly aggregating it down into into a smaller set so you can still look at certain charts over long periods. What's the Prometheus solution here? MOAR INSTANCES. And reads need to be performant or you'll have very angry engineers during the next SEV1, because their dashboards aren't loading. So you throw in an additional caching solution like Trickster (which rocks!) between Grafana and the metrics. Back in the Kairosdb days you had to know a fair bit about running Cassandra clusters, but these days it's all baked into Mimir.

I'm lucky enough to be working for a smaller company right now, so I don't have to spend a lot of time tending to the monitoring systems. I love that Mimir is basically Prometheus backed by S3, with all of the scalability and redundancy features built in (though you still have to configure them in large deployments). As long as you're small enough to run their monolithic mode you don't have to worry about individually scaling half a dozen separate components. The actual challenge is getting the $CLOUD side of it deployed, and then passing roles and buckets to the nasty helm charts while still making it easy to configure the ~10 things that you actually care about. Oh and the helm charts and underlying configs are still not rock solid yet, so upgrades can be hairy.

Ditto all of that for logging via Loki.

It's very possible that Mimir is no better than Victoria Metrics, but unless it burns me really badly I think I'll stick with it for now.

Not doing anything wrong. It scales better and has better performance. Works well. Prometheus is also fine.
Well, they claim superior performance (which might be true), but the costs are high and include a small community, low quality APIs, best effort correctness/PromQL compatibility, and FUD marketing, so I decided to go with the de-facto standard without all of the issues above.
No costs if you're hosting everything. It does scale better and has better performance. Used it and have nothing bad to say about it. For the most part a drop-in replacement that just performs better. Didn't run into PromQL compatibility issues with off-the-shelf Grafana dashboards.
Could you provide more details regarding low quality APIs and PromQL compatibility issues? The following article explains "issues" with PromQL compatibility in VictoriaMetrics - https://medium.com/@romanhavronenko/victoriametrics-promql-c... . See also https://docs.victoriametrics.com/metricsql/ . TL;DR: MetricsQL fixes PromQL issues with rate() and increase() functions. That's why it is "incompatible" with PromQL.

Could you provide examples of FUD marketing from VictoriaMetrics?

I am on mobile, so cannot really link GitHub for examples, but I'd recommend anyone considering using VM over Prometheus to take a cursory look into how similar things are implemented in both projects, and what shortcuts were made in the name of getting "better performance".

Performance-wise e.g. VictoriaMetrics' prometheus-benchmark only covered instant queries without look back for example the last time I checked.

Regarding FUD marketing: All Prometheus community channels (mailing lists, StackOverflow, Reddit, GitHub, etc.) are full of VM devs pushing VM, bashing everything from the ecosystem without mentioning any of the tradeoffs. I am also not aware of VictoriaMetrics giving back anything to the Prometheus ecosystem (can you maybe link some examples if I am wrong?) which is a very similar to Microsoft's embrace, extend, and extinguish strategy. As per recent actual examples, here's a 2 submission of the same post bashing project in the ecosystem: https://news.ycombinator.com/item?id=40838531, https://news.ycombinator.com/item?id=39391208, but it's really hard to avoid all the rest in the places mentioned above.

> Performance-wise e.g. VictoriaMetrics' prometheus-benchmark only covered instant queries without look back for example the last time I checked.

prometheus-benchmark ( https://github.com/VictoriaMetrics/prometheus-benchmark ) tests CPU usage, RAM usage and disk usage for typical alerting queries. It doesn't test the performance of queries used for building graphs in Grafana because the typical rate of alerting queries is multiple orders of magnitude bigger than the typical rate of queries for building graphs, e.g. alerting queries generate the most of load on CPU, RAM and disk IO in typical production workload.

Please file a feature request at https://github.com/VictoriaMetrics/prometheus-benchmark/issu... to add the ability to test resource usage for typical queries for building Grafana graphs if you think this will be a good feature.

> I am also not aware of VictoriaMetrics giving back anything to the Prometheus ecosystem (can you maybe link some examples if I am wrong?)

Sure:

- https://github.com/prometheus/prometheus/issues?q=author%3Av...

- https://github.com/prometheus/prometheus/issues?q=author%3Al...

- https://github.com/prometheus/prometheus/issues?q=author%3Ah...

> As per recent actual examples, here's a 2 submission of the same post bashing project in the ecosystem: https://news.ycombinator.com/item?id=40838531

This submission posts a link to the real-world experience of long-term user of Grafana Loki. This user points to various issues in applications he uses. For example:

- Issues with Loki restarts - https://utcc.utoronto.ca/~cks/space/blog/sysadmin/GrafanaLok...

- Issues with structured metadata in Loki 3.0 - https://utcc.utoronto.ca/~cks/space/blog/sysadmin/GrafanaLok...

- Issues with single-node Loki setup - https://utcc.utoronto.ca/~cks/space/blog/sysadmin/GrafanaLok...

- Issues with Loki logcli command - https://utcc.utoronto.ca/~cks/space/blog/sysadmin/GrafanaLok...

- Issues with Grafana Loki data compaction - https://utcc.utoronto.ca/~cks/space/blog/sysadmin/GrafanaLok...

- Comparison of Grafana Loki vs traditional syslog server - https://utcc.utoronto.ca/~cks/space/blog/sysadmin/GrafanaLok...

As you can see, this user shares his extensive experience with Grafana Loki, and continues using it despite the fact that there is much better solution exists, which is free from all the Loki issues - VictoriaLogs. This user isn't affiliated with VictoriaMetrics in any way.

That's how to monitor not what to monitor
Yeah, Ive been working on deploying such with added txtai indexing so I can just ask my stack questions - setup txtai workflows and be able to slice questions across what youre monitoring.