I have a fair amount of experience with Linux, but I'm a little lost by what's going on here. Can someone out there dumb this down a bit for an init newbie?
You can pass the kernel options before booting. i.e. "quiet" or "debug". The Kernel exposes these options via /proc/cmdline.
systemd parses /proc/cmdline and if "debug" is set, it also raises it's log level to debug.
Then systemd logs so much to the kernel log that it gets flooded and fails to boot.
Edit: And story wise, systemd says yeah, we're not going to fix this or rename the debug flag for systemd to "systemd.debug" because blah.
So basically, systemd is catching the common keyword 'debug' from a command that being given to something else entirely, and in turn enabling its own logging, which floods the system. The command wasn't actually meant for systemd, but for the kernel. Is that correct?
You're correct, but from a user perspective it's a totally valid thing to do. When your system (or pretty much any software) doesn't work, you want to pass some "debug" flag so that the maximum information is logged and you can parse that later. You don't want to pass one debug flag per component, you want it to be global (just like when you say "quiet", you want every component to be quiet).
Pretty much. Though Linus says he's okay with other people parsing these options, that's why they expose them. But of course you should do it in a sane way.
And systemd refuses to patch their code to namespace their flag and listen for "systemd.debug" instead of the generic debug flag that is primarily meant for the kernel.
The kernel has a debug option that can be specified at boot time to increase its logging verbosity. systemd looks at the kernel command line to do the same and starts writing its own verbose logging out to the kernel's message buffer. Writing out to the kernel message buffer is done for pragmatic reasons: generally if you're debugging init problems they're happening when the machine is first starting so you would like to see the error messages on the serial console/virtual console.
The bug that started this all came up when someone had a broken system that was causing systemd to spew out tons of log messages. This was overwhelming the kernel message system and causing things to time out/etc and resulted in an unbootable system. When this problem got opened upstream, the systemd maintainer said "not my problem."
There are a couple of ways forward here: systemd can limit its own output and say "my bad"[1], systemd can begin looking for its own debug flag in the kernel command line instead of using the kernel's own debug flag or the kernel can implement rate limiting on the message system[2] to keep userspace from shooting itself in the foot.
Torvalds really doesn't like the systemd maintainer's "not my problem" attitude so he's threatened to no longer accept pull requests from him for any code he writes. Torvalds wrote a really good post in the linked Google+ comments that describes his feelings on the whole thing:
"I don't mind people piggy-backing on some fairly obvious generic term like "debug" per se. I don't know if the old init scripts did that, but I do know they did it for "quiet", which is basically the reverse of "debug".
What I mind is people closing bugs and not admitting mistakes. If Kay had even said "sorry, the excessive output was a bug in systemd, it's already fixed in current -git", that would have been a valid reason to close the bug.
And for the people thinking this is a storm in a teacup: this is not the first time Kay has done this, which is why I personally get so frustrated.
Kay has done the exact same thing with major bugs that were not fixed anywhere else, and that caused machines to fail at boot time, and Kay happily pointed the finger elsewhere for months at a time and closed bugzilla entries.
Side explanatory note: and it's because of that known history of abusive behavior that I would prefer systemd now use "systemd.debug".
The old init scripts may or may not have parsed "debug", but we never had any reason to care. Now we do, and people are (I think) understandably upset that systemd not only screwed up, but then the people involved weren't even willing to say "sorry" about it but instead go "uh, it wasn't our bug, deal with it"."
> systemd can limit its own output and say "my bad"[This happened]
Did it? I didn't find any such commit. IMHO it's the correct fix. /dev/kmsg is privileged and systemd should rate-limit what it writes there after giving priority to more critical messages.
Okay, I think 6a0f1f6d5af7c7300d3db7a0ba2b068f8abd222b "sd-event: rework API to support CLOCK_REALTIME_ALARM and CLOCK_BOOTTIME_ALARM, too" removed the dual_timestamp_is_set assert. It looks like it was fixed by accident, and there's zero acknowledgement that flooding /dev/kmsg could be a problem on systemd's side.
In a nutshell, a Linux kernel developer wants to limit the number of debugging messages he is obliged to read and act on, and Linus Torvalds thinks that's counterproductive, and Linus generally says what he thinks.
systemd parses /proc/cmdline and if "debug" is set, it also raises it's log level to debug. Then systemd logs so much to the kernel log that it gets flooded and fails to boot.
Edit: And story wise, systemd says yeah, we're not going to fix this or rename the debug flag for systemd to "systemd.debug" because blah.
And Torvalds gets pissed.