Hacker News new | ask | show | jobs
by jcgrillo 798 days ago
The thing about telemetry data is it's extremely repetitive. Take for example a CLF[1] log line:

  127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
As written this is 99 bytes (792 bits), but how much information is actually in it? We have an IP address which is taking up 9 bytes but only needs at most 4 (fewer in cases like this where two of the bytes are zero if we employ varint encoding). Across log lines the ident and user will likely be very repetitive, so storing each unique occurrence more than once is really wasteful. The timestamp takes up 28 bytes but only needs 13 bytes--far fewer if that field is delta encoded between log lines. The HTTP method is taking up 5+ bytes, it's only worth 1 byte. The URLs are also super redundant--no need to store a copy in each line. The HTTP version is 1 byte but it's taking up 8. The status code is taking up 3 bytes but it's only worth 1--there are only 63 "real" HTTP status codes. The content length is taking up 4 bytes when it needs only 2. So I guess this log line only really has ~33 bytes of information in it (assuming a 32 bit pointer for each string--ident, user, URL). Much less if amortized across many lines. So maybe by naively parsing this log line and throwing a bunch of them in columnar, packed protobuf fields (where we get varint encoding for free), and delta-encoding the timestamps, and maintaining a dictionary for all the strings, we might achieve something like a ~5x compression ratio.

Playing around with gzip -9 on some test data[2] (not exactly CLF, but maybe similar entropy) I'm getting like ~1.9x compression.

Obviously if I parse this log line into a JSON blob, that blob will compress with a much higher ratio due to the repetitive nature of JSON, but it'll still be larger than the equivalent compressed CLF.

I'm working on a demo for my "protobof + fst[3]" idea, so I'm not sure if my "maybe ~5x" claim is totally off the mark or not. But I'm confident we can do way better than JSON.

[1] https://en.wikipedia.org/wiki/Common_Log_Format [2] https://www.sec.gov/about/data/edgar-log-file-data-sets [3] https://crates.io/crates/fst

EDIT: I guess maybe another way to state my conjecture is "telemetry compression is not general purpose text compression". These data have a schema, and by ignoring that fact and treating them always as schemaless data (employing general purpose text compression methods) we're leaving something on the table.

2 comments

My hunch is that JSON using a custom compression dictionary with zlib (see zdict argument to https://docs.python.org/3/library/zlib.html#zlib.compressobj) or zstandard would get you most of the benefit while still letting you interact with existing JSON tools. I've not put the work in to prove that to myself though!
Labels or other predefined constants being useless, compressing them better is not going to win the argument.

Have a look at the description and performance of a non-toy time series database published 10 years ago:

https://www.vldb.org/pvldb/vol8/p1816-teller.pdf

Convenience of text and json is an argument, but performance??

Yeah that would be an interesting experiment too.

This blog post has some interesting ideas as well: https://www.uber.com/blog/reducing-logging-cost-by-two-order...

This is all quite true, but a possibly faster way to prototype would be to use DRAIN algorithm (there are rust and python impls that are easy to use) to determine the "log template". Then push the log template when its first seen and nothing but values after that, into a programmatically generated table in a common columnar format like Parquet or Iceberg. Then you can point the myriad of data analysis tools like DuckDb, DataFusion, or the latest InfluxDb at it, you've got your SQL on logs implemented. It can feel a bit Rube-Goldbergish, and it's a bit tricky to navigate the space uninformed because it's early, but it can also handle all other data your company uses in one platform, no need to special case "applications" from the "data analysis"/historical/log side. One place to handle permissions. Then there are tools like Dagster for managing this humongous single database in a straightforward way rather than writing a web of applications that push and pull to it but without a complete picture being possible, or needing devs to remember their place in the system. Search up Uber CLP for prior art, or more generally the "modern data stack" (PRQL will be perfect for querying logs). But by piggybacking on big systems like this, you can take advantage of future advancements in state of the art, like BtrBlocks https://www.google.com/url?sa=t&source=web&rct=j&opi=8997844.... Of course, if the savings/earnings are high enough, I guess you start on implementing this now.
Thanks I wasn't aware of either DRAIN or BtrBlocks. CLP is very cool. Honestly I'm not sure what a good query experience looks like. I really enjoy the flexibility of mapreduce because there are no "unsolvable" problem--if a high level DSL like Hive or Pig gets in the way you just drop down a level to Spark or streaming Python mapreduce or whatever. So ultimately rather than a "DSL for logs" I'd rather have more like a "programming model for logs". I don't know what this looks like in 2024, hopefully not still actually hadoop/EMR.