Hacker News new | ask | show | jobs
by Veserv 1354 days ago
If you have exactly two logging operations in your entire program:

log(“Began {x} connected to {y}”)

log(“Ended {x} connected to {y}”)

We can label the first one logging operation 1 and the second one logging operation 2.

Then, if logging operation 1 occurs we can write out:

1, {x}, {y} instead of “Began {x} connected to {y}” because we can reconstruct the message as long as we know what operation occurred, 1, and the value of all the variables in the message. This general strategy can be extended to any number of logging operations by just giving them all a unique ID.

That is basically the source of their entire improvement. The only other thing that may cause a non-trivial improvement is that they delta encode their timestamps instead of writing out what looks to be a 23 character timestamp string.

The columnar storage of data and dictionary deduplication, what is called Phase 2 in the article, is still not fully implemented according to the article authors and is only expected to result in a 2x improvement. In contrast, the elements I mentioned previously, Phase 1, were responsible for a 169x(!) improvement in storage density.

1 comments

So does it means you need to specify rules for all unique logging operations you have? I was under impression that the thing can do it automatically... In a runtime based on receiving repetitive logs or something like that. If it doesn't, it is still a hell of work to compile and maintain the list of rules for your applications.