| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jiggawatts 564 days ago

Your feelings are spot on.

In most modern distributed tracing, "observability", or similar systems the write amplification is typically 100:1 because of these overheads.

For example, in Azure, every log entry includes a bunch of highly repetitive fields in full, such as the resource ID, "Azure" as the source system, the log entry Type, the source system, tenant, etc...

A single "line" is typically over a kilobyte, but often the interesting part is maybe 4 to 20 bytes of actual payload data. Sending this involves HTTP overheads as well such as the headers, authentication, etc...

Most vendors in this space charge by the gigabyte, so as you can imagine they have zero incentive to improve on this.

Even for efficient binary logs such as the Windows performance counters, I noticed that second-to-second they're very highly redundant.

I once experimented with a metric monitor that could collect 10,000-15,000 metrics per server per second and use only about 100MB of storage per host... per year.

The trick was to simply binary-diff the collected metrics with some light "alignment" so that groups of related metrics would be at the same offsets. Almost all numbers become zero, and compress very well.

2 comments

kiitos 564 days ago

You never send a single individual log event per HTTP request, you always batch them up. Assuming some reasonable batch size per request (minimum ~1MiB or so) there is rarely any meaningful difference in payload size between gzipped/zstd/whatever JSON bytes, and any particular binary encoding format you might prefer.

link

jiggawatts 563 days ago

Most log collection systems do not compress logs as they send them, because again, why would they? This would instantly turn their firehose of revenue cash down to a trickle. Any engineer suggesting such a feature would be disciplined at best, fired at worst. Even if their boss is naive to the business realities and approves the idea, it turns out that it's weirdly difficult in HTTP to send compressed requests. See: https://medium.com/@abhinav.ittekot/why-http-request-compres...

HTTP/2 would also improve efficiency because of its built-in header compression feature, but again, I've not seen this used much.

The ideal would be to have some sort of "session" cookie associated with a bag of constants, slowly changing values, and the schema for the source tables. Send this once a day or so, and then send only the cookie followed by columnar data compressed with RLE and then zstd. Ideally in a format where the server doesn't have to apply any processing to store the data apart from some light verification and appending onto existing blobs. I.e.: make the whole thing compatible with Parquet, Avro, or something other than just sending uncompressed JSON like a savage.

link

kiitos 563 days ago

Most systems _do_ compress request payloads on the wire, because the cost-per-byte in transit over those wires is almost always frictional and externalized.

Weird perspective, yours.

link

piterrro 563 days ago

They will compress over the wire, but then decompress and ingest counting billing for uncompressed data. After that, an interesting thing will happen, because they will compress the data along other interesting techniques to minimize the size of the data on their premises. Cant blame them... they're just trying to cut costs but the fact that they are charging so much for something that is so easily compressible is just... not fair.

link

jiggawatts 563 days ago

A part of the problem is that the ingestion is not vector compressed, so they're charging you for the CPU overhead of this data rearrangement.

It would cut costs a lot if the source agents did this (pre)processing locally before sending it down the wire.

link

piterrro 562 days ago

We should distinct between compression in transit and at rest. Compressing a larger corpus should yield better results in comparison to smaller chunks because dictionaries can be reused (zstd for example)

link

david38 563 days ago

This is why metrics rule and logging in production need only be turned on to debug specific problems and even then have a short TTL

link

jiggawatts 563 days ago

You got... entirely the wrong message.

The answer to "this thing is horrendously inefficient because of misaligned incentives" isn't to be frugal with the thing, but to make it efficient, ideally by aligning incentives.

Open source monitoring software will eventually blow the proprietary products out of the water because when you're running something yourself, the cost per gigabyte is now just your own cost and not a profit centre line item for someone else.

link

piterrro 563 days ago

Unless you start attaching tags to metrics and allow engineers to explode cardinality of the metrics. Then your pockets need to be deep.

link