Hacker News new | ask | show | jobs
by ElectricSpoon 884 days ago
> I would guess the developers wanted to prevent laptops running out of battery too quickly

And I would guess sysadmins also don't like their logging facilities filling the disks just because a service is stuck in a start loop. There are many reasons to think a service failing to start multiple times in a row won't start. Misconfiguration is probably the most frequent reason for that.

3 comments

Exactly. If a service crashes within a second ten times in a row, it's not going to come up cleanly an eleventh time. The right thing to do is stay down, and let monitoring get the attention of a human operator who can figure out what the problem is. Continually rebooting is just going to fill up logs, spam other services, and generally make trouble.

I'm sure there are exceptions to this. For those, set Restart=always. But it's an absolutely terrible default.

It might actually, if a network connection is temporarily down.
Or a disk not attached yet. Or another service it depends on being slow to finish starting up.
So, you two know how systemd gets heat for doing too much, right?

This is one of those things.

The 'After=' and 'Requires=' directives address this.

Depends on a mount? Point those directives at a '.mount' unit.

Depends on networking, perhaps a specific NIC? Point those directives at 'systemd-networkd-wait-online@$REQUIRED_NIC.service'

Point being: declare these things, don't wait for entropy to eventually become stable.

After and Requires are only when starting the service though. If a service (stupidly) crashes when a network connection is temporarily down (someone tripped over the router's power cord?), it needs to restart until the network connection is back up.
Sure, but now we're kind of back where we started: 'Restart='

With the requirements properly laid out we've avoided restarting in a loop and a bit of robustness

There's also 'PartOf=' which can help make the relationship bidirectional

I get your point, but these features are the bare minimum any boot system should have. If someone calls that “bloat”, they should go back and hit rocks together.
Agreed. Relationships in 'init' are principle

Back on point though: don't expect the 11th restart to work when the last 10 didn't.

Contrived examples are contrived, it's solved. Declaring dependencies.

Interestingly, the kubernetes approach is the opposite one. Dependencies between pods / software components are encouraged to be a little softer, so that the scheduler is simpler.

Starting up, noticing that the environment doesn't have what you need yet and dying quickly appears to be The Kubernetes Way. A scheduler will eventually restart you and you'll have another go. Repeat until everything is up.

The kubelet operates the same way afair. On a node that hasn't joined a cluster yet, it sits in a fail/restart loop until it's provisioned.

Heh. We used syslog at one place, with it configured to push logs into ELK. The ingestion into ELK broke … which caused syslog to start logging that it couldn't forward logs. Now that might seem like screaming into a void, but that log went to local disk, and syslog retried it as fast as disk would otherwise allow, so instantly every machine in the fleet started filling up its disks with logs.

(You can guess how we noticed the problem…)

Also logrotate. (And bounded on size.)

it's wild how easy it is to misconfigure (or not configure) logrotate properly and have a log file fill up the disk. Out of memory and/or out of disk are the two error cases that have led to the most pain in my career. I think most people who started with docker in the early days (long before there was a docker system prune) had this happen where old docker containers/images filled up the disk and wreaked havoc at an unsuspecting point.
I used to joke that if VMware engineers couldn't figure out the logrotate configuration for their own product for a few releases, what chance do I have?
I've seen bad service design having e.g.

Before=systemd-user-sessions.service

This means that as long as systemd is trying to (re)start the service, nobody can log in. Which is a problem with infinite restarts.

It's still pretty easy to accidentally set up an infinite restart loop with the default settings if your service takes more than 2s to crash.