Hacker News new | ask | show | jobs
by varikin 2777 days ago
I use to work at a Fortune 50 retailer on the cloud platform (a lot of tooling around CI/CD for the teams that manage the website). We had a large problem with keeping the metrics pipeline current. A major issue is that be default, Spring Boot publishes about 500 different metrics on a 10 second slice. Allowing every application to pump out that many default metrics, most of which are never used, means that it takes only 334 instances to get one million metrics per minute (1,000,000 / 6 / 500 = 333 1/3). I would guess we had a couple thousand instances in production on a normal day. In a couple weeks, they have Black Friday. Any team that hasn't been able to fix performance problems are given the go ahead to just throw money at it and scale horizontally in obscene ways.

Of course, our metrics were all handled in house. From talking to the teams that handled the metrics pipeline, the vendors were great for smaller companies, but there was no off the shelf solution for a companies that large with that volume. But I did very little with that myself, other than look into the fact that Spring Boot published way too many default metrics. Who needs P50, P70, P75, P80, P85, P90 - P99 on all web requests?! Just set a default that is small and worthwhile and let the developers adjust as needed.

2 comments

why this whole per minute thing I can insert 1000000 per minute on my mbp using reasonable batching (it's only 16K per second)
Folks need to resist the inclination to just gather maximum data for the hell of it.

If you're pumping out a million metrics per minute, almost none of those are ever going to actually be used to generate meaningful insight.

I used to work at a startup that made physical robots. The robot generated several GBs of data every time it turned on. You're correct, most of that data wasn't looked at most of the time. But every now and then, someone would say "Hey, I saw a robot do something funny the other day, what the hell happened?" And having all that data usually made it possible to figure out what happened. To me, "maximum data for the hell of it" isn't about generating insight by looking at trends, it's about generating insight during incident analysis.
> To me, "maximum data for the hell of it" isn't about generating insight by looking at trends, it's about generating insight during incident analysis.

Agree 100%.

That is a very particular use case, where very high res data is critical. I note that even here, you're interested in data from "the other day", not years ago.

In most cases, time spent maintaining terabytes of rapidly aging time series data would be better spent elsewhere.

I think that really depends on the case.

A particularly good high-frequency trader might be interested in Terabytes of minutia when they're trying to sort out what caused yesterday's spike and crash of ticker XYZ.

Systems and sales analysts that are looking at web store front ends (and back ends, if there are issues) would be interested in large volumes of data, specifically corner cases (users who don't follow a statistically significant path), when trying to sort out a UI/UX redesign.

Traffic and transit analysts might want terabytes of data (especially with date and weather indicators) when considering what kind of freeway interchange to add to a growing area.

I suppose I could go on...