Hacker News new | ask | show | jobs
by jiggawatts 1412 days ago
Windows has something like 15,000 performance counters and error metrics that can be collected. There isn’t a system on earth that can even approach this. At scale, I have to pick and choose maybe 20-100 counters for fear of overloading a cluster(!) of servers collecting the data… once a minute.

That’s because the protocol overheads cause “write multiplication” of a hundred-to-one or worse. Every byte of metric ends up nearly a kilobyte on the wire.

Meanwhile I did some experiments that showed that even with a tiny bit of crude data-oriented design and delta compression a single box could collect 10K metrics across 10K endpoints every second without breaking a sweat.

The modern REST / RPC approach is fine for business apps but is an unmitigated disaster for collecting tiny metrics.

Set your goals higher than collecting a selected subset of 1% of the available metrics 60x less frequently than admins would like…