Hacker News new | ask | show | jobs
by nrr 4055 days ago
Oddly enough, even for large (>=1e5 physical machines) systems, grep works fine. Better yet, if the logs are important, you're shunting them off for some sort of longer-term storage for post-processing and indexing _anyway_, irrespective of the underlying disk format. Some folks continue to use plain text even then, just with some distributed systems magic wrapped around the traditional Unix tools.

(If you're shunting _all_ of your log data off at that scale, you're crazy, and you'll melt your switches if you aren't careful.)

The name of the game is to think of the problems that you're solving and how they relate to the business bottom line. No sooner, no later. Additionally, what's most troubling is that we've turned this exercise into an emotional one, not one with any sort of scientific-oriented perspective.

I can personally say with conviction that I'd like to sit down and actually collect data on, e.g., how many instructions it takes to store logs to disk in plain text versus a binary format, how many it takes to retrieve logs from disk in both situations, and how much search latency I incur when trying to retrieve said logs from disk in the same. At scale, which is where most of my attention lies these days, that's the kind of thing that matters because those effects get amplified automatically—often to operators' and capacity planners' horrors—by the number of machines you have.

If you're dealing with smaller systems, it won't matter as much, but at that point, you're probably dealing with the other side of this, which is having information on how many requests you get for historical log data and what sort of criteria were used in that search. If you're getting requests less frequently than, say, once per quarter, it likely wouldn't be worth your time to invest in what Mr. Nagy is evangelizing.

tl;dr: Continue using your ad hoc grep-fu, but be mindful of how much time it takes you to get the data you're looking for. That alone will be your decision criterion for adopting something like this.

1 comments

grep definitely breaks down on large systems. I have one environment with approx 5 million nodes - (1e6), and the only way to coherently manage the log updates from them is in binary format.

But even still - I like to have the text files as journals of original entry - so I can occasionally do a tail -f incoming.log| egrep -i "somedevice".

And having the original files in text format is zero impediment to getting them into handy binary database form.

I hate arguing semantics, but 1e6 is not just large but very large indeed. (:

That said, I'd be curious to know some more of the details of that system actually! If you're aggregating all of those devices together, using something binary in that context definitely makes sense. In fact, if I were in your shoes and tasked with designing some means of solving that problem, I would probably use something like protobuf or capnp to emit those messages since they're well-known and well-understood serialization mechanisms.

Now, that's the integration and aggregation side of this exercise.

On a local node-by-node basis, though, I absolutely agree; having the raw text as journals of original entry for inspection in real time with `tail -f` (or, if you're using multilog, `tail -F`…) would still be incredibly useful.

Going back to Mr. Nagy's article, the space of problems that `tail -f` solves is barely overlapped by the space of problems solved by aggregation. I think he's conflated the two spaces in his article here (and especially in the one previous) whereby he's applied a one-size-fits-all solution to both where it demonstrably does not fit all.

The remote nodes all log to central DNS servers, and Trap Servers. The DNS servers have a nice update.log file that provides their IP address information, and some nice text configs. The trap data, goes into a binary file (database actually) and requires analysis through a web interface.

As a result - the DNS updates are used by me approximately 20x more often than the trap data, when doing diagnostics, even though, in theory, the trap data is incredibly richer, and, of course, has the 15 mandatory fields that are functions of the binary logging. (Time, Date, Event ID, Trap Type, etc, etc...)

Memories of supporting subscriber CPEs and having to go through Drum to analyze the data coming out of logged SNMP traps/notifications are flashing back. Thanks for that. (:

But, yeah, assuming that the nodes in discussion here are not amd64 machines but are instead subscriber CPEs, that's a totally workable (and, frankly, agreeable) solution.