Hacker News new | ask | show | jobs
by ghshephard 4055 days ago
grep definitely breaks down on large systems. I have one environment with approx 5 million nodes - (1e6), and the only way to coherently manage the log updates from them is in binary format.

But even still - I like to have the text files as journals of original entry - so I can occasionally do a tail -f incoming.log| egrep -i "somedevice".

And having the original files in text format is zero impediment to getting them into handy binary database form.

1 comments

I hate arguing semantics, but 1e6 is not just large but very large indeed. (:

That said, I'd be curious to know some more of the details of that system actually! If you're aggregating all of those devices together, using something binary in that context definitely makes sense. In fact, if I were in your shoes and tasked with designing some means of solving that problem, I would probably use something like protobuf or capnp to emit those messages since they're well-known and well-understood serialization mechanisms.

Now, that's the integration and aggregation side of this exercise.

On a local node-by-node basis, though, I absolutely agree; having the raw text as journals of original entry for inspection in real time with `tail -f` (or, if you're using multilog, `tail -F`…) would still be incredibly useful.

Going back to Mr. Nagy's article, the space of problems that `tail -f` solves is barely overlapped by the space of problems solved by aggregation. I think he's conflated the two spaces in his article here (and especially in the one previous) whereby he's applied a one-size-fits-all solution to both where it demonstrably does not fit all.

The remote nodes all log to central DNS servers, and Trap Servers. The DNS servers have a nice update.log file that provides their IP address information, and some nice text configs. The trap data, goes into a binary file (database actually) and requires analysis through a web interface.

As a result - the DNS updates are used by me approximately 20x more often than the trap data, when doing diagnostics, even though, in theory, the trap data is incredibly richer, and, of course, has the 15 mandatory fields that are functions of the binary logging. (Time, Date, Event ID, Trap Type, etc, etc...)

Memories of supporting subscriber CPEs and having to go through Drum to analyze the data coming out of logged SNMP traps/notifications are flashing back. Thanks for that. (:

But, yeah, assuming that the nodes in discussion here are not amd64 machines but are instead subscriber CPEs, that's a totally workable (and, frankly, agreeable) solution.