Hacker News new | ask | show | jobs
by benjaminjackman 4055 days ago
Maybe I am misunderstanding but it sounds like you are encountering bad json log file practices because json entries are spanning multiple lines. Which implies they are being printed in non compact form aka prettified. Thats a problem in the pure text world too. And hurts worse when it happens there. Its kind of an apples to oranges comparison.

Json log files should ideally print using compact form (which will never have raw newlines) so each entry only takes one line, which is then separated by a raw \n

If that practice is followed each line will represent the complete json object. So you can then pipe the file through jq, Perl, python etc one line at a time.

Printing prettified json to a log should be avoided because it then requires having to reconstitute individual events syntactically before being able to grep for an event. if pretty output is desired pipe it through a prettifier.

Config files are a different story, those should most definitely be pretty printed with one atom per line for nice diffability and the best read and editability json can offer. Sadly json for config files is, unfortunately, a bad idea if you want humans to enjoy editing them by hand. In that case using yml is the best option I have encountered (ansible).

1 comments

I have no problem with json output in log files, but I would greatly prefer it be constrained to the message portion of a logline. At a minimum I generally want three things per line, a timestamp (in ISO 8601 or something close), a message type (info, warning, error, etc) or log entry source, and the message itself. I don't want to be looking into the JSON structure itself for a timestamp, especially when the field encoding the timestamp may be called something slightly different based on what generated the log...

In that respect, whether the message is JSON, or YAML, or XML doesn't matter, that can easily be worked on later, but the first thing I want to be able to do is filter by time and type.

>I don't want to be looking into the JSON structure itself for a timestamp

A) JSON parsers are relatively common and reliable.

B) The timestamp would be human readable even without the parser.

>especially when the field encoding the timestamp may be called something slightly different based on what generated the log...

I often come across logs that put timestamps in different places on the line and encode them differently (or don't output a timestamp at all, sometimes). This is no different to having to deal with a differently named JSON property.

My point is really around having the date be in a well defined place that isn't necessarily defined by the application that's logging. If the log entry date is at the beginning of the line, there's no ambiguity as to whether it's the log entry date or some other date being logged, and it also doesn't require parsing the JSON at all to filter by the date. If it's not at some very standard location that's easy to filer by (a possibly changing JSON property does not qualify), they it's hard to know you are filtering on the right data, and may also require transform before filtering. JSON parsers are fast. Multi-GB log files will still cause some extra overhead and slow the operation down, so it's best to reduce the working set before parsing the JSON.
>My point is really around having the date be in a well defined place that isn't necessarily defined by the application that's logging. If the log entry date is at the beginning of the line, there's no ambiguity as to whether it's the log entry date or some other date being logged, and it also doesn't require parsing the JSON at all to filter by the date.

Take this example:

1-1-15 1:1:1 Info Log message A

12-13-15 12:34:55 Debug Log message B

12-13-15 1:34:55 Error log message C

12-13-15 1:34:55Error log message D

[12-13-15 1:34:55]Error log message E

It doesn't require parsing JSON to get the date, you're right about that. It's harder than parsing JSON, though.

Note two replies of mine prior where I state ISO 8601 or similar. Also not where I said the json would be constrained to the message portion of the entry. Preferably there's a logging mechanism that takes care of that for you, so you can't screw up the timestamp and type portions of the entry. In that case, your entries become:

  2015-01-01 01:01:01 Info Log message A
  2015-12-13 12:34:55 Debug Log message B
  2015-12-13 01:34:55 Error log message C
  2015-12-13 13:34:55 Error log message D # let's assume that was 1 PM data for the sake of the example
  2015-12-13 01:34:55 Error log message E
Getting the date is trivial. Getting the type is also trivial. Give a static field size to type and it's event more so. The point is, you abstract the message from the rest of it, so the message can't screw up the metadata of the entry, and log whatever you want for the actual message (xml, json, plain text, whatever, just no raw newlines). This is what we have today with syslog, sans newline replacement and a slightly different date format (but still unambiguous). It works. It's useful. It's VERY easy to filter type type or date. You can take the first X chars and split on space/whitespace if you need to. You can log a message of a few megabytes and if there's no raw newlines there's efficient utilities to ignore that until you have what you want (/bin/cut).