Hacker News new | ask | show | jobs
by jcgrillo 795 days ago
Another facet of this is how do we store telemetry data? Fully indexed instantaneously searchable seems to be the "default" these days but who actually needs that?

I keep harping on this, but compressed utf-8 text (or even worse, compressed json) is a horribly wasteful way to do it. See [1]. Putting a small amount of thought into storing telemetry data seems like it could yield incredible savings at scale.

[1] https://lists.w3.org/Archives/Public/www-logging/1996May/000...

2 comments

I was gonna make a post but this took the words out of my mouth. I have a whole talk about this exact topic, but the summary is that the paradigm of hot storage and then 2 weeks later, compressed archive, is the most wasteful way we could possibly organize this data. I discuss this at length in the talk below:

https://www.youtube.com/watch?v=XXgBJmqv0ok

Nice talk. The first (and best!) logs search solution I experienced in my career was simply a gigantic tree of compressed logs on a hadoop cluster. As someone who spent a bunch of time analyzing logs, the "query interface" being "anything you can sling at the hadoop cluster" was phenomenally awesome. The basic computering tools are programming languages, and eventually you encounter problems where you need a real (Turing-complete) one.

One great side effect of this was service developers weren't afraid to write logs. We logged excessively, and it didn't cost too much. If we'd been indexing everything in ES it would have bankrupted us.

These days with S3 and the cloud, hadoop (or the EMR suite) per se probably isn't the way to go, but I'd sure like to see observability solutions giving me a first-class programming model that I as a user can interact with--not some bespoke "query DSL", and for them to accept that instantaneous indexed retrieval isn't important.

This paper is really interesting: https://www.usenix.org/system/files/osdi21-rodrigues.pdf

Stuff like this gives me hope we can have it both ways. With highly tuned compression and programmatic access the user is empowered and the cost is minimized.

I thought compressed JSON was pretty efficient. How much would you expect to save over that with a custom binary format?
Storing date in compressed json consist of:

- converting every number into its sequence of digits in decimal notation,

- writing those one character at a time,

- also write the string representation of the label of each value repeatedly for every record,

- compress all this with a structure-unaware generic text compression algorithm based on longest match search.

Each time you want to read that data, undo all of the above in reverse order.

You can optimize to some degree, but that's basically it.

I expect that not doing any of this saves the time spent doing it. I also expect data type aware compression to be much more efficient than text compressing the text expansion.

In numbers, I expect 2 to 3 orders of magnitude difference in time and also in space (for non random data).

The network difference between compressed JSON or a compressed format is likely negligible.

But jcgrillo was talking about storage (at least his link was). And when parsing for analysis or for storing millions of points daily, there's no doubt that a binary format is simply a lot more CPU and disk efficient.

Usually the JSON gets transformed into a binary format (example: BSON).
The thing about telemetry data is it's extremely repetitive. Take for example a CLF[1] log line:

  127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
As written this is 99 bytes (792 bits), but how much information is actually in it? We have an IP address which is taking up 9 bytes but only needs at most 4 (fewer in cases like this where two of the bytes are zero if we employ varint encoding). Across log lines the ident and user will likely be very repetitive, so storing each unique occurrence more than once is really wasteful. The timestamp takes up 28 bytes but only needs 13 bytes--far fewer if that field is delta encoded between log lines. The HTTP method is taking up 5+ bytes, it's only worth 1 byte. The URLs are also super redundant--no need to store a copy in each line. The HTTP version is 1 byte but it's taking up 8. The status code is taking up 3 bytes but it's only worth 1--there are only 63 "real" HTTP status codes. The content length is taking up 4 bytes when it needs only 2. So I guess this log line only really has ~33 bytes of information in it (assuming a 32 bit pointer for each string--ident, user, URL). Much less if amortized across many lines. So maybe by naively parsing this log line and throwing a bunch of them in columnar, packed protobuf fields (where we get varint encoding for free), and delta-encoding the timestamps, and maintaining a dictionary for all the strings, we might achieve something like a ~5x compression ratio.

Playing around with gzip -9 on some test data[2] (not exactly CLF, but maybe similar entropy) I'm getting like ~1.9x compression.

Obviously if I parse this log line into a JSON blob, that blob will compress with a much higher ratio due to the repetitive nature of JSON, but it'll still be larger than the equivalent compressed CLF.

I'm working on a demo for my "protobof + fst[3]" idea, so I'm not sure if my "maybe ~5x" claim is totally off the mark or not. But I'm confident we can do way better than JSON.

[1] https://en.wikipedia.org/wiki/Common_Log_Format [2] https://www.sec.gov/about/data/edgar-log-file-data-sets [3] https://crates.io/crates/fst

EDIT: I guess maybe another way to state my conjecture is "telemetry compression is not general purpose text compression". These data have a schema, and by ignoring that fact and treating them always as schemaless data (employing general purpose text compression methods) we're leaving something on the table.

My hunch is that JSON using a custom compression dictionary with zlib (see zdict argument to https://docs.python.org/3/library/zlib.html#zlib.compressobj) or zstandard would get you most of the benefit while still letting you interact with existing JSON tools. I've not put the work in to prove that to myself though!
Labels or other predefined constants being useless, compressing them better is not going to win the argument.

Have a look at the description and performance of a non-toy time series database published 10 years ago:

https://www.vldb.org/pvldb/vol8/p1816-teller.pdf

Convenience of text and json is an argument, but performance??

Yeah that would be an interesting experiment too.

This blog post has some interesting ideas as well: https://www.uber.com/blog/reducing-logging-cost-by-two-order...

This is all quite true, but a possibly faster way to prototype would be to use DRAIN algorithm (there are rust and python impls that are easy to use) to determine the "log template". Then push the log template when its first seen and nothing but values after that, into a programmatically generated table in a common columnar format like Parquet or Iceberg. Then you can point the myriad of data analysis tools like DuckDb, DataFusion, or the latest InfluxDb at it, you've got your SQL on logs implemented. It can feel a bit Rube-Goldbergish, and it's a bit tricky to navigate the space uninformed because it's early, but it can also handle all other data your company uses in one platform, no need to special case "applications" from the "data analysis"/historical/log side. One place to handle permissions. Then there are tools like Dagster for managing this humongous single database in a straightforward way rather than writing a web of applications that push and pull to it but without a complete picture being possible, or needing devs to remember their place in the system. Search up Uber CLP for prior art, or more generally the "modern data stack" (PRQL will be perfect for querying logs). But by piggybacking on big systems like this, you can take advantage of future advancements in state of the art, like BtrBlocks https://www.google.com/url?sa=t&source=web&rct=j&opi=8997844.... Of course, if the savings/earnings are high enough, I guess you start on implementing this now.
Thanks I wasn't aware of either DRAIN or BtrBlocks. CLP is very cool. Honestly I'm not sure what a good query experience looks like. I really enjoy the flexibility of mapreduce because there are no "unsolvable" problem--if a high level DSL like Hive or Pig gets in the way you just drop down a level to Spark or streaming Python mapreduce or whatever. So ultimately rather than a "DSL for logs" I'd rather have more like a "programming model for logs". I don't know what this looks like in 2024, hopefully not still actually hadoop/EMR.