Hacker News new | ask | show | jobs
by xahrepap 98 days ago
Honest question, how do you handle high cardinality data points?

Reference to where my brain is at: https://www.robustperception.io/cardinality-is-key/

I feel like splunk’s business model favors a healthy system and gives major disadvantages to an unhealthy one. What I mean in an example: when the system is unhealthy, I know it because all my splunk queries get queued up because everyone is slamming it with queries. I hate it.

But I’m stuck in knowing how to move some things to Prometheus. Like say we have a CustomerID and we want to track number of times something is done by user. If we have thousands of customers, cardinality breaks that solution.

Is there a good solution for this?

2 comments

This gets even worse if you have a language with one process per CPU as you can get clobbering other values on the same instance if you don't add fields to uniquely identify them.

We got a lot of pushback when migrating our telemetry to AWS after initially being told to just move it when they saw how OTEL amplified data points and cardinality versus our old StatsD data.

You probably need less cardinality than you think, and there are a mix of stats that work fine with less frequent polling, while others like heap usage are terrible if you use 20 or 30 second intervals. Our Pareto frontier was to reduce the sampling rate of most stats and push per-process things like heap usage into histograms.

An aggregator per box can drop a couple of tags before sending them upstream which can help considerably with the number of unique values. (eg, instanceID=[0..31] isn't that useful outside of the box)

Asking this question got me to stop being lazy and actually try to answer my own question. Mimir being one that caught my eye

https://grafana.com/oss/mimir/