Hacker News new | ask | show | jobs
by ghshephard 4055 days ago
Speaking for myself - multiline .json output is problematic, as most of the parsing tools work best when the data is on a single line, and it's a cognitive struggle to deal with multi-line output, even if you are clever with your tools. I usually have to end up writing a json parser in python to get the data into a format that I can manipulate it. (Thankfully, python does 95% of the work for you when reading a json file)

But - here is the thing, even though the .json format isn't convenient for me, I can, with about 20-30 minutes effort, write a parser that can get the data into a convenient format, because it started out as a text file.

2 comments

If you're just grepping for a single word or phrase it really isn't much different to grepping regular logs.

If you're extracting structured data (e.g. getting the time stamp and a status code), it's actually easier than screwing around with awk and figuring out which exact column the time stamp finishes on and hoping that server #7 doesn't put it on a different column.

Like you said, a few minutes' work in python.

I wish more services did this.

Well - to be clear, if I I run into a log file with it's data on a single line, 95% of the time it will take < 30 seconds to extract the data I need. If I run into a multi-line json file, trying to re-integrate all the data back into a single record will take me on the order of 30 minutes. (Mostly because I usually only do it once or twice a year, so I typically start from first principles each time. Multi-Line .json log files are very rare.)

95% of the time I just give up on the multi-line .json files - unless it's really, really critical, I probably don't want to spend 30 minutes writing code to re-assemble the data.

Text Log files, wherever possible, should capture their data on a single line. If they need to go multi-line, then having a transaction ID that is common among those lines, makes life easier.

.json files (or xml files), are an interesting halfway point between pure text, and pure binary. They aren't easily parseable without tools, but, if you have to, you can always write your own tools to parse them.

Neither fish nor fowl.

Maybe I am misunderstanding but it sounds like you are encountering bad json log file practices because json entries are spanning multiple lines. Which implies they are being printed in non compact form aka prettified. Thats a problem in the pure text world too. And hurts worse when it happens there. Its kind of an apples to oranges comparison.

Json log files should ideally print using compact form (which will never have raw newlines) so each entry only takes one line, which is then separated by a raw \n

If that practice is followed each line will represent the complete json object. So you can then pipe the file through jq, Perl, python etc one line at a time.

Printing prettified json to a log should be avoided because it then requires having to reconstitute individual events syntactically before being able to grep for an event. if pretty output is desired pipe it through a prettifier.

Config files are a different story, those should most definitely be pretty printed with one atom per line for nice diffability and the best read and editability json can offer. Sadly json for config files is, unfortunately, a bad idea if you want humans to enjoy editing them by hand. In that case using yml is the best option I have encountered (ansible).

I have no problem with json output in log files, but I would greatly prefer it be constrained to the message portion of a logline. At a minimum I generally want three things per line, a timestamp (in ISO 8601 or something close), a message type (info, warning, error, etc) or log entry source, and the message itself. I don't want to be looking into the JSON structure itself for a timestamp, especially when the field encoding the timestamp may be called something slightly different based on what generated the log...

In that respect, whether the message is JSON, or YAML, or XML doesn't matter, that can easily be worked on later, but the first thing I want to be able to do is filter by time and type.

>I don't want to be looking into the JSON structure itself for a timestamp

A) JSON parsers are relatively common and reliable.

B) The timestamp would be human readable even without the parser.

>especially when the field encoding the timestamp may be called something slightly different based on what generated the log...

I often come across logs that put timestamps in different places on the line and encode them differently (or don't output a timestamp at all, sometimes). This is no different to having to deal with a differently named JSON property.

My point is really around having the date be in a well defined place that isn't necessarily defined by the application that's logging. If the log entry date is at the beginning of the line, there's no ambiguity as to whether it's the log entry date or some other date being logged, and it also doesn't require parsing the JSON at all to filter by the date. If it's not at some very standard location that's easy to filer by (a possibly changing JSON property does not qualify), they it's hard to know you are filtering on the right data, and may also require transform before filtering. JSON parsers are fast. Multi-GB log files will still cause some extra overhead and slow the operation down, so it's best to reduce the working set before parsing the JSON.
Not to disagree with you in any way, but `jq` is something you might look to add to your toolbox. As must JSON as we see anymore, it's a good tool to have.
Multiline text logs are terrible, but you can log JSON in a single line.

It's still verbose, but it compresses well.