Hacker News new | ask | show | jobs
by statictype 2777 days ago
Nice article.

>its not uncommon for some of our customers to send us millions of metrics every minute

What kind of customers/services generate millions of points a minute?

11 comments

I use to work at a Fortune 50 retailer on the cloud platform (a lot of tooling around CI/CD for the teams that manage the website). We had a large problem with keeping the metrics pipeline current. A major issue is that be default, Spring Boot publishes about 500 different metrics on a 10 second slice. Allowing every application to pump out that many default metrics, most of which are never used, means that it takes only 334 instances to get one million metrics per minute (1,000,000 / 6 / 500 = 333 1/3). I would guess we had a couple thousand instances in production on a normal day. In a couple weeks, they have Black Friday. Any team that hasn't been able to fix performance problems are given the go ahead to just throw money at it and scale horizontally in obscene ways.

Of course, our metrics were all handled in house. From talking to the teams that handled the metrics pipeline, the vendors were great for smaller companies, but there was no off the shelf solution for a companies that large with that volume. But I did very little with that myself, other than look into the fact that Spring Boot published way too many default metrics. Who needs P50, P70, P75, P80, P85, P90 - P99 on all web requests?! Just set a default that is small and worthwhile and let the developers adjust as needed.

why this whole per minute thing I can insert 1000000 per minute on my mbp using reasonable batching (it's only 16K per second)
Folks need to resist the inclination to just gather maximum data for the hell of it.

If you're pumping out a million metrics per minute, almost none of those are ever going to actually be used to generate meaningful insight.

I used to work at a startup that made physical robots. The robot generated several GBs of data every time it turned on. You're correct, most of that data wasn't looked at most of the time. But every now and then, someone would say "Hey, I saw a robot do something funny the other day, what the hell happened?" And having all that data usually made it possible to figure out what happened. To me, "maximum data for the hell of it" isn't about generating insight by looking at trends, it's about generating insight during incident analysis.
> To me, "maximum data for the hell of it" isn't about generating insight by looking at trends, it's about generating insight during incident analysis.

Agree 100%.

That is a very particular use case, where very high res data is critical. I note that even here, you're interested in data from "the other day", not years ago.

In most cases, time spent maintaining terabytes of rapidly aging time series data would be better spent elsewhere.

I think that really depends on the case.

A particularly good high-frequency trader might be interested in Terabytes of minutia when they're trying to sort out what caused yesterday's spike and crash of ticker XYZ.

Systems and sales analysts that are looking at web store front ends (and back ends, if there are issues) would be interested in large volumes of data, specifically corner cases (users who don't follow a statistically significant path), when trying to sort out a UI/UX redesign.

Traffic and transit analysts might want terabytes of data (especially with date and weather indicators) when considering what kind of freeway interchange to add to a growing area.

I suppose I could go on...

I used to work at Meraki. There, we would, every 5 minutes or so, record byte and packet counts classified by traffic class and remote end of the connection for every single client connected to an SMB internal or customer-facing network.

(Meraki actually did implement its own time-series database, and after I left published a paper describing its design and implementation. https://meraki.cisco.com/lib/pdf/trust/lt-paper.pdf. Good quote on the motivation: "As discussed in Section 2.3.3, customers have a nearly insatiable demand for high-resolution historical data, even though they mostly query data from the recent past."

Any more good links from them?
Not that I can find - it's a proprietary solution for internal use only.
Some companies record as much as they can, e.g. "here's all of the user's mouse-movements, their full User Agent, etc." or "here's everywhere the user tapped on the app and which of the notifications we sent them that they responded to, and more!"

The example from the article was "one team at one of our customers decided to dump 30 million metrics on us, send all of their mobile product metrics into Outlyer"

Machine-generated event telemetry from mobile phones, cars, etc can easily be tens of millions per second in real-world applications. Human-generated event telemetry (e.g. text messaging) peaks at hundreds of thousands per second if you are working on global scales.

There is virtually an unlimited number of applications that could generate 16k events per second (million per minute).

At one employer most customers would opt-in to upload their system logs, and company would analyze to anticipate problems for preventive maintenance.

The result was a tens of terabytes a day Niagara of data.

When I left they were in early stages of Hadoop because ordinary parse/analyze was starting to fall behind.

16k samples/s is not a lot. There are many Prometheus users with hundreds of thousands of samples/s on a single Prometheus server.

Across their organisations it can be much more, Fastly has reported 2.2M/s (https://promcon.io/2018-munich/slides/monitoring-at-scale-mi...) for example.

Often per-click stuff ends up with dozens or hundreds of data points from different parts of the code -- heartbeats, feature usage, funnels, experiment entry, etc.
1 mil is only 16K per second.
We have customers that generate tens of millions of measurements per second. Lots of low-level systems latencies can be collected at high volume. Also, high volume online services can easily generate this order of magnitude.
Transportation companies--truck fleet data, airplanes, etc.
Customers using a large-scale analytics database...