Hacker News new | ask | show | jobs
by ghshephard 4055 days ago
It's beyond me how he doesn't understand that text logs are a universal format, easily accessible, that can be instantly turned into whatever binary format you desire with a highly efficient insertion process (Splunk is just one of those that does a great job).

Here is the thing he doesn't seem to understand - all of us who are sysadmins absolutely understand the value of placing complex and large log files into database so that we can query them efficiently. We also understand why having multi-terabyte text log files is not useful.

But what we find totally unacceptable is log files being shoved into binary repositories as the primary storage location. Because you know what everyone has their own idea of what that primary storage location should be, and they are mostly incompatible with each other.

The nice thing about text - for the last 40 years it's been universally readable, and will be for the next 40 years. Many of these binary repositories will be unreadable within a short period, and will be immediately unreadable to those people who don't know the magic tool to open them.

5 comments

> text logs are a universal format

Uh, I don't know what world you live in but I'd like the address because mine sucks in comparison.

Text logs are definitely not a "universal format". Easily accessible, sure. Human readable most of the time? Okay. Universal? Ten times nope.

Give you an example: uwsgi logs don't even have timestamps, and contain whatever crap the program's stdout outputs, so you often end up with three different types of your "universal format" in there. I'm not giving this example because it's contrived, but because I was dealing with it the very moment I read your comment.

But at least you have a fighting chance. What if that exact same data was dumped into a binary file, that you did not know how to decode?

Originally, you had a problem - the data wasn't formatted in a manner that you could parse cleanly.

Now, you have a new problem - not only is the data not formatted properly, it's now in some opaque binary file.

Saying that there are poorly formatted text files isn't a hit against text files, it's a hit against poor formatting. The exact same problem exists if the file is in binary form, and not formatted properly.

> a binary file, that you did not know how to decode

I guess nobody ever advocated putting stuff in a binary file with an undefined format. Databases, syslog-ng, elasticsearch and the systemd journal all have a defined format with plenty of tools to access the data in a more structured way (eg. treating dates as dates and matching on ranges).

I agree the issue at hand is not just binary vs. plain text, it's more "how much you want to structure your data".

The classic syslog format is very loosely defined, with every application defining its own dialect, each with its own way to separate fields and handle escaping. To fix that you could store the log data as JSON as many online services are doing. But once you have JSON, grep is no longer enough to properly handle the data even if it's still plain text. Now that you have both a quite verbose format on disk and the need for custom tools, why not store the log as binary encoded JSON (eg. something like JSONB in PostgreSQL)? Or make it even more efficient with an format optimized for the specific usage? Add some indexes and you get more or less what databases, ElasticSearch and the journal do.

Also keep in mind that most of the logs right now gets rotated and compressed with gzip, I'd doubt that the above binary formats are less resilient to errors than a gzip stream.

Sure, an opaque binary file is pointless.

But that's not what most logging systems that log to binary files offer. They give you specs (example: http://www.freedesktop.org/wiki/Software/systemd/journal-fil...) and tools.

Binary doesn't have to mean closed/opaque.

That's what the grandparent was explaining though. We have near-ubiquitous tools for dealing with plaintext files. Every Linux admin knows them and uses them in many more situations than just log files. They can be scripted and piped, and an admin worth his salt could easily find the info he needs with them.

A binary file from whatever logging system, OTOH, is effectively proprietary. Even if the logging system provides you with tools to work on them, you have to 1) know that it's a log file for that logging system, and 2) be familiar enough with the tools in order to work with it.

And the specs will be gone in 40 years. While ASCII will stick around.

    And the specs will be gone in 40 years. While ASCII will stick around.
Why would they be gone? You realize ASCII is a 'spec' too?

If a binary format has an open specification, it's as future proof as ASCII. ASCII's durability is due to a clear and open specification that's easily implemented. Not some magic sauce that makes it instantly human readable.

That text you see? It's not what's actually in the file. That's just 1's and 0's like every other format. There's literally no difference between ASCII and any other "binary" format.

Text encodings have come and gone before, too. We don't use the Baudot code on modern computers still, and EBCDIC is confined to IBM mainframes.
Does that really matter? Log files are often unimportant when they get over a month or two old, what is it in your log files that has to be kept for 40 years?

Longevity of log files hardly seems like a reason to pick an otherwise inferior format.

It is not about reading 40 years old logs, but rather reading logs from today generated by 40 years old system.

For example, many nuclear power plant in the west were built 40 years ago. Amongst the myriad of sensors, devices in a power plant, I think that most of them are outputting ASCII logs. There are still readable today. (Same can be said about avionics, space probes, etc.)

Now imagine yourself 40 years from now on, trying to fix or reverse engineer a very legacy system, you will have to recompile a journalctl from 40 years ago before being able to read anything.

I've been producing a few services recently which output a chunk of JSON for each log message followed by a newline.

I think it actually solves most of the problems text logs have that binary don't (inability to easily present structured data, etc.) yet keeps the advantages of a text log (human readable, resistant to file corruption, future-proof).

Speaking for myself - multiline .json output is problematic, as most of the parsing tools work best when the data is on a single line, and it's a cognitive struggle to deal with multi-line output, even if you are clever with your tools. I usually have to end up writing a json parser in python to get the data into a format that I can manipulate it. (Thankfully, python does 95% of the work for you when reading a json file)

But - here is the thing, even though the .json format isn't convenient for me, I can, with about 20-30 minutes effort, write a parser that can get the data into a convenient format, because it started out as a text file.

If you're just grepping for a single word or phrase it really isn't much different to grepping regular logs.

If you're extracting structured data (e.g. getting the time stamp and a status code), it's actually easier than screwing around with awk and figuring out which exact column the time stamp finishes on and hoping that server #7 doesn't put it on a different column.

Like you said, a few minutes' work in python.

I wish more services did this.

Well - to be clear, if I I run into a log file with it's data on a single line, 95% of the time it will take < 30 seconds to extract the data I need. If I run into a multi-line json file, trying to re-integrate all the data back into a single record will take me on the order of 30 minutes. (Mostly because I usually only do it once or twice a year, so I typically start from first principles each time. Multi-Line .json log files are very rare.)

95% of the time I just give up on the multi-line .json files - unless it's really, really critical, I probably don't want to spend 30 minutes writing code to re-assemble the data.

Text Log files, wherever possible, should capture their data on a single line. If they need to go multi-line, then having a transaction ID that is common among those lines, makes life easier.

.json files (or xml files), are an interesting halfway point between pure text, and pure binary. They aren't easily parseable without tools, but, if you have to, you can always write your own tools to parse them.

Neither fish nor fowl.

Maybe I am misunderstanding but it sounds like you are encountering bad json log file practices because json entries are spanning multiple lines. Which implies they are being printed in non compact form aka prettified. Thats a problem in the pure text world too. And hurts worse when it happens there. Its kind of an apples to oranges comparison.

Json log files should ideally print using compact form (which will never have raw newlines) so each entry only takes one line, which is then separated by a raw \n

If that practice is followed each line will represent the complete json object. So you can then pipe the file through jq, Perl, python etc one line at a time.

Printing prettified json to a log should be avoided because it then requires having to reconstitute individual events syntactically before being able to grep for an event. if pretty output is desired pipe it through a prettifier.

Config files are a different story, those should most definitely be pretty printed with one atom per line for nice diffability and the best read and editability json can offer. Sadly json for config files is, unfortunately, a bad idea if you want humans to enjoy editing them by hand. In that case using yml is the best option I have encountered (ansible).

Not to disagree with you in any way, but `jq` is something you might look to add to your toolbox. As must JSON as we see anymore, it's a good tool to have.
Multiline text logs are terrible, but you can log JSON in a single line.

It's still verbose, but it compresses well.

Like you, I also deal with the kinda weird uwsgi logs. I feel like "universal format" probably didn't mean the format of all the lines in all the logs is the same - though your definition is probably more accurate.

Despite that, I can be pretty sure when I walk in to a foreign system there will be nginx logs, just where I expect them, almost certainly in the format I'm used to. And even if the format differs, it's not much of a problem. Binary logs, big problem.

Sure, on a site that uses ElasticSearch for its logs I would have no idea where to look at. I'd be more at ease with SQL, but first you need to locate the DB, figure out the schema, get the SQL dialect right.

That said, I'd be far more at ease writing a SQL query to extract analytics from logs than cooking up some regexes and doing complex stuff with awk.

And I find the --since/--until parameters to journalctl far easier than matching dates by regex. Or even the --boot parameter to restrict logs to a specific boot, which with would be probably doable with awk but definitely not as trivial.

I think that binary logs give you some compelling features, without taking away any: you can always just dump the logs on stdout and use grep as much as you want. :)

He's referring to the universal format being text itself.
"Text logs" is not a format at all, so it can't really be a universal format, either. But if there were such a thing as a "universal" format, it would probably by definition encompass everything in time and space. You think timestamps are a problem? Just wait until your logs get trapped in a quantum state. Talk about a heisenbug...
"But what we find totally unacceptable is log files being shoved into binary repositories as the primary storage location"

The way I read his article, he's not really opposed to additionally keeping your logs around as text. But you make a good point of using text as the primary storage location, since you can always easily feed it to some binary system for further analysis.

Would the best practice then be to keep your logs around as (compressed) text, but additionally feed it to your log analysis system of choice for greater querying capabilities?

Exactly. And I think that's what every shop that has discovered Splunk (or other such tools) has started doing. Sysadmins love log data in queryable format in a database. I'm the hugest advocate of this. I have some queries that took greater than 30 minutes when coming from a modest text files, that can be performed in under 50 msec when in a database.

But don't cripple me by shoving your primary log files into binary format so I can't quickly pull data out of them with awk/grep/sed when I need to quickly diagnose a local issue.

Agreed. Logs are for when everything and anything is broken. They aren't supposed to be pretty or highly functional, they are just meant as a starting point for gathering data.
why not just do both?

our product stores all the logs raw in flats files on the file system, we don't use databases for keeping the logs in, this allows you to scale massively (ingestion limit is that of the correlation engine and disk bandwidth). You then just need an efficient search crawler and use of metadata so search performance is good too.

Issue is if you every need to pull the logs for court and you have messed with them (i.e. normalized them and stuffed them into a DB) then your chain of custody is broken.

Best of both worlds means parsed out normalisation so I don't have to remember that Juniper calls source ip srcIP and Cisco SourceIP, but the original logs under the covers for grepping if you need.

> text logs are a universal format

Then punch in the face is a universal form of communication. Also EBCDIC is the only encoding future will recognize!