Hacker News new | ask | show | jobs
by thaumaturgy 4055 days ago
Cool, so which standard binary log storage format should we all switch to?

Should I submit patches to jawstats so that it'll support google-log-format 1.0 beta, or the newer Amazon Cloud Storage 5 format? Or both? Or just go with the older Microsoft Log Storage Format? Or wait until Gruber releases Fireball Format? Has he decided yet whether to store dates as little-endian Unix 64 bit int timestamps, or is he still thinking about going with the Visual FoxPro date format, y'know, where the first 4 bytes are a 32-bit little-endian integer representation of the Julian date (so Oct. 15, 1582 = 2299161) and the last 4 bytes are the little-endian integer time of day represented as milliseconds since midnight? (True story, I had to figure that one out once. Without documentation.)

Should I write a new plugin for Sublime Text to handle the binary log formats? Or write something that will read the binary storage format and spit out text? Or is that too inefficient? Or should I give up on reading logs in a text form at all and write a GUI for it (maybe in Visual Basic)?

Do you know when I should expect suexec to start writing the same binary log format as Apache, or should I give up waiting on that and just write a daemon to read the suexec binary logs and translate them to the Apache binary logs?

Should I take the time to write a natural language parsing search engine for my custom binary log format? Do you think that's worth the time investment? I would really like to be able to search for common misspellings when users ask about a missing email, you know, like "/[^\s]+@domain.com/" does now.

I look forward to your guidance. I've been eagerly awaiting the day that I can have an urgent situation on my hands and I can dig through server logs with all of the ease and convenience of the Windows system logs.

1 comments

The system should provide a standard API for writing and reading logs. The precise format of the underlying log files is thus rather unimportant at this level of abstraction. Other than the logging subsystem and recovery tools, there's no need for any software to be accessing such log files directly (outside of the API functions). This is how Windows has done it for years.
Even if you can manage one per OS, it's not good enough. Have you ever worked in a non-monoculture or dealt with recovery of a system severely damaged by malice or accident?

I doubt my Linux (including webOS & Android), FreeBSD, and OS X boxes are going to settle on a single binary format in the next couple of decades or even a single API & toolset. In your brave new world the very first thing I'm going to need to do if I have to combine logs across them is to extract data from at least three formats and the most convenient format is often going to be text - i.e. right back where we started, but with extra work for each OS. More likely you'll get a mix of things using the system APIs, custom binary formats, custom text formats, and syslog. Adding more steps to get at the same data doesn't help.

More importantly, binary logs are unreliable when you're dealing with a system that's completely trashed. You can often get usable text logs off a disk that's throwing I/O errors every few dozen bytes or even from a corrupted raw disk image. They may not be cryptographically "sealed", but I'd rather have them than an error message about the binary format being corrupt. That should be an implementation detail, but I haven't seen much interest from the binary logs camp in making the file formats resilient.

You missed the joke at the end where he correctly pointed out that Windows' logging is a total joke, and that discovering information from Windows logs is essentially impossible unless the tool writer specifically predicted your use case.

And that's the nub of it: text logs are for when you may have many varied, complex reader use-cases, and you don't understand all those cases well enough yet to lock them down forever, and you have a thousand excellent tools at your disposal that you would like to be able to continue to use.

Recent log spelunking for me included 'cat log.? | grep fail | sed 's/^.worker_id$//g' | awk '{ print $5, $4 }' | sort -n -r | sed 30q'.

There's no analogue in any binary logging system I've ever found.

It seems to me that a simple transitional tool for a binary logging system would be for the implementer of the binary logging system to also include a tool that consumed a binary log file on stdin and produced a stream on stdout in one (or more, selecting which by command line arguments) common text log formats.

That lets you develop an ecosystem of supporting tools that take advantage of any strengths of the binary format, while still allowing the freedom of using the (initially, at least, probably far more capable) set of tools available for the text formats.

what is the point of such a 'transition' if there never arrives any point at which there is net added value to a binary format?
If there is some (not initially necessarily net for all users -- benefit being, after all, something that varies from user to users, but significant for some subset of users) benefit, the point is to mitigate the cost of moving out of a native text format, and increase the number of users for whom there is initially a net benefit, which also increases the initial use of the binary format and the effort likely devoted to building auxiliary tools which leverage it to some advantage, increasing the speed at which the net benefit of the format for a wider range of users is increased.

This may or may not ever make it a net benefit for every user, but that's okay. There's a whole lot of space between "this technology is the best choice for everyone" and "this technology is the best choice for no one".

This isn't really true as the Windows event logs contain text as well as the other structured data, which you can search for using tools on the system. For example to search for some specific text in the system log using Powershell:

    Get-EventLog -LogName System | Where {$_.Message -Match "something"}
To process text as fields, as with awk, one would use the Split method (at least to start off with):

    Get-EventLog -Log System | Where {$_.Message -Match "something"} | %{ $_.Message.Split()[5,4] }
But as message text is often parameterised, it may be easier to take advantage of this data to get what you need. For instance, this command would extract the latest machine sleep and wake times from the system log, and calculate the duration:

    Get-EventLog -Log System -Source Microsoft-Windows-Power-Troubleshooter -InstanceID 1 | Select-Object @{n="SleepTime";e={$_.ReplacementStrings[0]}}, @{n="WakeTime";e={$_.ReplacementStrings[1]}}, @{n="SleepDuration";e={([DateTime]$_.ReplacementStrings[1])-([DateTime]$_.ReplacementStrings[0])}}
One can also sort and get unique values, just as in Unix-type systems - this command lists all drives defragmented in the past 30 days:

   Get-EventLog -Log Application -Source Microsoft-Windows-Defrag -InstanceID 258 -After (Get-Date).AddDays(-30) | Select @{n="Drive";e={$_.ReplacementStrings[1]}} | Sort Drive | Unique -AsString
So all the same capabilities are there, and then some. You just need to know your tools well enough to take advantage of it.
Most binary formats contain text; that isn't what distinguishes them from text formats.

One of the objections though is that with binary formats you're limited to the capabilities of the tools that have been built to handle that particular format, which you're illustrating nicely. In a binary format world, I would have to know the capabilities and limitation of dozens, maybe hundreds of different tools for extracting useful information from logs, instead of the small handful of tools I use to do the same job now, which can be applied to any log file formatted as plain text.

And that's assuming that all these other tools will be as powerful as Powershell, which isn't a bet I'd want to make.

madhouse has some fair points about the limitations of text logs, but "everything should be stored in binary formats" is a not a great idea. Actually, "a terrifying new hell" is probably closer to how I feel about it.

In the case of wanting to stick to a text-only workflow, rather than taking advantage of the structured data features, then you only need a tool that converts the binary log format to your preferred text format. Which isn't too arduous. In systemd that would be journalctl, in Windows anything that can use the event log API such as Powershell or many other utilities.

The examples I posted above were just to show the equivalent capabilities in Powershell but really it's all flexible enough to use whatever you like.