Hacker News new | ask | show | jobs
by corytheboyd 562 days ago
You know what would actually kill for saving on log data? Being forced to use an efficient format. Instead of serializing string literals to text and sending all those bytes on each log, require that log message templates be registered in a schema, and then a byte or two can replace the text part of the message.

Log message template parameters would have to be fully represented in the message, it would be way too much work to register the arbitrary values that get thrown in there.

Next logical step is to serialize to something more efficient than JSON— should be dead simple, it’s a template followed by N values to sub into the template, could be a proprietary format, or just use something like protobuf.

It’s better than compression, because the data that was being compressed (full text of log message template) is just not even present. You could still see gains from compressing the entire message, in case it has text template values that would benefit from it.

I get it, we lost human readability, which may be too big of a compromise for some, but we accomplish the main goal of “make logs smaller” without losing data (individually timestamped events). Besides, this could be made up for with a really nice log viewer client.

I’m sure this all exists already to some degree and I just look dumb, but look dumb I will.

5 comments

Don’t worry about human readability. When you have an issue with log size, you are already logging more than a human can read.
We log TBs per hour and grep is enough for me to find interesting data quite effectively.

The problem with weird log formats is recreating all the neat stuff you can do with tooling not necessarily just being able to open a file in a text editor.

I think this is a really good point. A logging system could theoretically toggle "text" mode on and off, giving human readable logs in development and small scale deployments.

In fact, I'm going to build a toy one in python!

> In fact, I'm going to build a toy one in python!

I suggest building it as a normal python logging handler instead of totally custom, that way you don't need a "text" toggle and it can be used without changing any existing standard python logging code. Only requires one tweak to the idea: Rather than a template table at the start of the file, have two types of log entries and add each template the first time it's used.

Drawback is having to parse the whole file to find all the templates, but you could also do something like putting the templates into a separate file to avoid that...

Agreed. At that point you need specialized tools anyway.
not really.

I am writing code on my machine, running one query at a time. I can easily view the logs and spend a lot of time looking at them.

I am running test suites, running thousands of queries. It's harder, but I will still view the logs around failures.

Then I am taking the very code, and pushing it to prod. Should my logs be suddenly completely different in this case?

(The right answer of course is to have "log-to-text formatter", either running in-process or as a separate post-processing step. But it better produce nice-looking, human-readable logs, or every format will be "Message: %s" )

I once created a library (now bit rotted) that did all the things you suggested plus some: schema, binary representation, changing date times to offsets from the first record's date time, abbreviating common strings like hostnames etc.

There were a bunch of problems/irritants mostly stemming from the fact that the format become stateful. Every log needed to have a schema (or repository) available. Abbreviations and date offsets meant that the log contained meta information ... for example, assignment of a compact abbreviation to a string in anticipation of using that abbreviation from that point on. This meant that the log could not be arbitrarily lopped off.

And to my chagrin, I found that simply gzipping a json stream made it almost as compact! That's when I figured it wasn't worth it. I'd probably have investigated more if there was CPU or memory bandwidth pressure in that situation (due to creating more data just to compress it).

I started developing a tracing/span library that does just this: log messages are "global" (to a system/org) hierarchical "paths" + timestamp + a tagged union. The tagged union method allows you to have zero or more internal parameters that can be injected into a printf (or similar style) format string when printing, but the message itself is only a few bytes.

The benefits to this approach is it's dramatically easier to index and cheaper to store at any scale.

One thing I think people don't appreciate about logging efficiency is it enables you to log and store more and I think many don't appreciate how much even modest amounts of text logs can bog down systems. You can't read anything, but you filters easy and powerful and you can't filter something that doesn't exist.

Another thing people won’t appreciate is ANY amount of friction when they “just want to log something real quick”. Which has merit, you’re debugging some garbage, and need to log something out in production because it’s dumb, harmless, quick, and will tell you exactly what you need. That’s why I think you need a sort of fallback as well, for something like this to capture enough mindshare.

How did your solution work out in terms of adoption by others? Was it a large team using it? What did those people say? Really curious!

It doesn't really replace something like print-line debugging, but the type of system that benefits/can use print-line debugging would see no benefit from a structured logging approach either. The systems I'm targeting are producing logs that get fed into multi-petabyte Elasticsearch clusters.

To answer your question: the prototype was never finished, but the concepts were adapted to a production version and is used for structured events in a semi-embedded system at my work.

There are logging libraries that do this. The text template is logged alongside a binary encoding of the arguments. It saves both space and cpu.
Yup, I'm aware. My focus was more on scaling it out to large aggregation systems.
> should be dead simple, it’s a template followed by N values to sub into the template,

CSV without fixed columns would be fine for that.

> require that log message templates be registered in a schema, and then a byte or two can replace the text part of the message.

Pre-registering is annoying to handle, and compression already de-duplicates these very well. Alternatively the logger can track every template logged in this file so far, and assign it an integer on the fly.

> You know what would actually kill for saving on log data? Being forced to use an efficient format.

Logging is unstructured and free-form by design. The observability events that can be expressed with a fixed format are metrics events, which are already serialized in a trivial format. See line formats from services like Statsd.