Hacker News new | ask | show | jobs
by acdha 894 days ago
> Services timing out on start-up, "waiting x/y -- x/y retries

What would you prefer happen instead? Having seen plenty of downtime caused by not having retries or hanging indefinitely weren’t exactly improvements in my experience.

2 comments

Leave the service dead, or just kill -9 the process.

"service didn't start, oh well"

"service didn't stop, force killing it"

Is just fine.

It's my responsibility to ensure the service is operating correctly, not all-for-one systemd. Redhat bloatware dictating how Linux should be.

I am aware that I can change the timeout of retries, but that's documentation overhead.

> Is just fine.

It’s prolonged downtime, that’s what it is, and it may make the system unusable. I saw that several times during the transition period where a network outage or data center shutdown meant that some important daemons failed on startup (one of them had the concept of retrying a connection error, but not a DNS resolution failure) and the admins of the SysV boxes had to manually restart everything while the Upstart & systemd boxes recovered almost immediately. One fun case had the developer almost getting this right: they had implemented retries but without a time delay or back off so their init script maxed out its retry count in a second of “host not found” errors and then exited permanently.

When you leave things like retries, logging, or dropping permissions up to each daemon you end up with a hodgepodge of incomplete implementations and things like error handling are where corners tend to get cut the most since the situations are infrequent and often hard to simulate.

> It’s prolonged downtime, that’s what it is, and it may make the system unusable

IMHO, that's how I feel about systemd --- if a startup task is stalled, the system is non-interactive, unless something has changed since I last experienced this issue.

If some other operator access has started, you can use that, but the console is useless.

> It's my responsibility to ensure the service is operating correctly

Really, it's only on you to fix it if it does not work, which in my experience is far less likely with systemd.

> Really, it's only on you to fix it if it does not work, which in my experience is far less likely with systemd.

That's just lazy if your waiting for systemd to report you a failure. It's on you to maintain, ensure it's all operating correctly; including the daemon starting and stopping manually correctly.

Makes sense that on machines I've inherited that such checks are not done.

I've been running Arch continuously without a single complete reinstall for a decade now. I don't think it's fair to claim I wouldn't know how to maintain my systems correctly.
I'm not advertently claiming you cannot. But working with teams and systems where they wear the badge of "I'm a DevOp who can SysAdmin" and try to prove themselves in such as a manner throws myself a sigh.

If your unable to compile your own kernel, you don't pass in my books, it's a rite of passage.

> If your unable to compile your own kernel, you don't pass in my books, it's a rite of passage.

Pulling the PKGBUILD (or similar for other distros) and throwing in a couple of patches before `makepkg`-ing it is the easy part. Of course you could do it manually, but then what's even the point besides bragging rights. Very few people have a reason build a completely custom system LFS style, and those that do will.

At least let me cancel it from the console easily: control-c and continue. Waiting on retries that are never going to succeed is incredibly annoying.
Shouldn’t you need a password to log in? Seems bad to just let a some rando interrupt your boot process. It’s pretty easy to modify the timeouts that bother you.
The timeout issue is often a temporary issue after an upgrade. I need to get into the machine to debug and fix it, but I'm delayed waiting for the timeout.

If they have access to the physical console, they can already permanently interrupt your boot process with an alt-sysreq, ctrl-alt-delete, and likely several other methods (power switch, or pulling the plug.) I'd say being able to gracefully cancel a timeout/retry would be the least of your worries.

If "some rando" has access to the console keyboard while the system is being booted and some service is timing out... You've got a bigger problems than the fact they can cancel the start-up of that service.