Hacker News new | ask | show | jobs
by darkhelmet 1064 days ago
There is another problem that wasn't covered in the article. The 10+ years of stability leads to behaviors and outcomes that remind me of the long-lived SSL certificate problem. Updating is done so infrequently that the "how?" is forgotten. As the 10 year support limit approaches, most of the old team members who did it last time are gone, tech debt is through the roof, few people know where everything is or how to build it, and so on. Enterprise Linux "stability" enables all sorts of bad behavior if your company is inclined that way.

LetsEncrypt did us a huge favor by forcing automation vs having the guy who knows how to update the SSL certs every 4.9 years and left 6 months ago. I'd like to see the RHEL stability model go away too and force people to complete their automation and solve the problems of being able to rebuilding on demand - and actually doing it.

(I know, most HN folk are well disciplined but there are a lot of corporate cultures that are not.)

20 comments

Think of all the places Linux runs. Planes, trains, and automobiles. Medical equipment. So many other places. Many places that don't have readily available network access. Yet, many "enterprises" need support here.

If a medical device or train works but needs support for years and years, should someone be constantly updating Linux? What about the software that runs on Linux and is tested there?

Considering just the modern cloud environment really limits where enterprise Linux runs and is useful. And, where there are calls for really long support contracts.

Absolutely spot on here, there's no reason for the obsession HN (and the wider internet) has with upgrades in these kinds of environments. These are not environments where you can tolerate unplanned downtime, this isn't a silly web app running in us-east.
Upgrades are a necessity if development goes on.

And HN is full of developers.

On a related note, the new EU Cyber Resilience Act will make upgrades a necessity even in the most constrained environments like pacemakers.

It's easier to do an update with a single security fix rather than an update that rolls in a ton of new functionality that ends up breaking your device. Seen this time and time again with OS/dependency update.
> If a medical device or train works but needs support for years and years, should someone be constantly updating Linux? What about the software that runs on Linux and is tested there?

Yes. If you have embedded software in the field and it is running on hardware that has not reached its EOL, then you absolutely should be fixing bugs and vulnerabilities, doubly so when that hardware is attached to any kind of network, and triply so when the software talks to some kind of cloud services.

For most products, the customer should have the power decide when that hardware reaches EOL. In other words, it should be illegal (and severely punished) to disable or downgrade devices remotely, whether by abdicating the responsibility to maintain their software or by shutting down network services that those devices require to operate fully.

At the very least, that would prevent the proliferation of pervasive networking features that have no business communicating with anyone except their owners.

> Yes. If you have embedded software in the field and it is running on hardware that has not reached its EOL, then you absolutely should be fixing bugs and vulnerabilities

Ok, I'll be the unpopular person here.

Some bugs, even security ones: are ok.

I know that a very significant amount of software is mature these days, but sometimes upgrading causes different bugs, which are either harder to diagnose or even potentially more deadly.

I work in gamedev though, releases are tradeoffs with which bugs we accept.

It is very frequently the case that fixing one bug will incur several other bugs, it's just a case of understanding if they're worse or not.

For example: I don't care that my TV will have a 1/100 crash when launching the settings. I will just launch settings again-

Coincidentally: having everything constantly updating and internet connected is counter-intuitive. I used to spend entire evenings waiting for my PS4 to install new software on itself, but the PS2 (which contained many bugs!) worked much more often.

A known bug beats an unknown update in a whole lot of enterprise use cases. This drives devs insane but it is true.
Yes, there's substantial value in being able to have a specific bugfix and not have to upgrade an application and a ton of dependencies.
Or not even a fix. A known bug can be mitigated or worked around. Predictable is better.
It’s a limitation in SemVer too. Fixing a bug in a point release (e.g. keeping the api the same) could still result in undesired changes.
That brings flashbacks of my PS3. I had one, but I ended up not playing it because every time I switched it on there was a 2 hour mandatory download.

When you only have a couple of hours free time, spending it updating firmware means that I just don't update or use it.

> If you have embedded software in the field and it is running on hardware that has not reached its EOL, then you absolutely should be fixing bugs and vulnerabilities, doubly so when that hardware is attached to any kind of network, and triply so when the software talks to some kind of cloud services.

Talk to Qualcomm and NXP and MediaTek and Broadcom and STMicroelectronics and....

Seriously, most of this is completely out of the product engineers' control. We can't update the hardware because our vendors (ALL of our vendors) control the kernel selection, almost never upstream, and never keep the board support packages up to date. Never.

IME the only exceptions to that rule are RaspberryPi (kinda sorta), AMD, and Intel. I only recently managed to finally extricate all my projects from Linux 2.6 which I considered a minor miracle. In 2023.

One more exception is NXP i.MX 8M Quad, which is used in a Librem 5 phone.
> For most products, the customer should have the power decide when that hardware reaches EOL. In other words, it should be illegal (and severely punished) to disable or downgrade devices remotely, whether by abdicating the responsibility to maintain their software or by shutting down network services that those devices require to operate fully.

I mostly agree, but the severe punishment for not keeping software or services updated and running should be to release as much of the software as you own and a full hardware spec, well before you turn off the lights. People have to be allowed to get out of a business.

Doesn't matter if the manufacturer provides updates for embedded software if the customer doesn't want to install those updates because they'd have to incur the cost and downtime of retesting and updating their upstream software to address any incompatibilities caused by those fixes. It's more common than you'd think.
> whether by abdicating the responsibility to maintain their software

If you bought a product and the vendor said it will be supported for X years with updates — that’s all you get — X years. If you want to keep using a product for X+1 years, that’s fine, but that’s on you. I agree with your sentiment that hardware shouldn’t be disabled, but I don’t expect security updates past EOL. You can’t expect a business to keep supporting a product longer than they originally said they would (without compensation).

But these are enterprise products… support contracts last a long time. More than enough time to migrate (or past the time when you should be migrating). That’s part of why the long term support exists in the first place.

>>*Think of all the places Linux runs. """Planes, trains, and automobiles""". Medical equipment.*

Where is your other embedded equipment???

I love that movie.

You can throw away an old router and put a completely new one in place, and it will work. A pretty easy way to upgrade.

Usually you cannot do the same with a flight computer or a medical device controller.

That's already handled in the automotive world. Either wait until entering the garage with wifi, or have a mobile modem installed.
Years back (late 1990s) I worked with some mainframe people. They bragged that everything is hot fixable on the fly so they can apply security updates or replace broken hardware without rebooting. Then they admitted they schedule a reboot every 6 months anyway. Turns out the redundant backup power supply failed in at the same time and one hot patch was not applied to startup scripts and it took a week to figure out what was missing so the system booted again. By rebooting every 6 months they remember everything and so can get the system back up.

I probably have some details wrong in the story above. I worked with those people, but never on the mainframe. I think the point stands though, if you don't do something often it can't be done.

This doesn't surprise me. Mainframes aren't just about never failing; they have a whole culture, including ops, around providing availability in ways that actually work.
I've also heard of teams that shut down the mainframe for an hour during a time change. It's an easy way to avoid application issues for a small amount of downtime.
We used to do this on several hpux servers at $dayjob. However 95% of those servers have long since been decommissioned, and the remaining server didn't actually need it to begin with. (It was really anything that had an oddball database that needed it)
If it runs Unix is isn't a mainframe.

Only half joking.

I would say it's 90% not joking.

A mainframe has hardware different enough to require a different approach to the OS.

Of course a modern variant of System 390 happily runs tons of Linux VMs.

Starting by having systems programming languages that actually have proper strings, arrays and bounds checking.
What are some of those languages? I'm curious to learn more.
Several PL/I dialects e.g. PL/S and PL.8, BLISS, Modula-2, ESPOL/NEWP for example.

Also Pascal and BASIC compilers with several extensions, e.g. VMS Pascal and VMS BASIC.

BLISS is a typeless word-oriented language like BCPL, so I am surprised to see it in this list. Also I am amused to hear VAX/VMS described as a mainframe operating system.
> I'd like to see the RHEL stability model go away too and force people to complete their automation and solve the problems of being able to rebuilding on demand - and actually doing it.

Whenever there's a new distribution release it invariably breaks a bunch of things with the automation and you spend more time massaging your playbook so it works again than it would have taken to do it by hand

systemd threw the biggest wrench, by far, in my automation workflows (this was before containers came to dominate a lot of the landscape, so everything was managed with init scripts). I like it now, but it also broke things quite frequently in the early days, and there was a looong period of time when you had to shim software to work with systemd.

But even still, things like snaps, the way Debian handles system Python, and various little changes that have an outsize effect on automated deployment do cause a good amount of churn with automation.

Yup, enterprise Linux insulates you from unneeded change (in the business context). For most companies, systemd will have no impact on the bottom line vs sysvinit vs whatever.

However, paying an extra engineer to sort through all the changes possibly will.

On the other hand, there's some interesting trends like monokernels and minimal OS images that leverage services running off-machine instead of expecting so many local services removing some of the complexity/volatility (DNS, SMTP, federated login)

>things like snaps, the way Debian handles system Python

Both these things should not be an issue for anyone, just one or the other.

> Whenever there's a new distribution release it invariably breaks a bunch of things with the automation and you spend more time massaging your playbook so it works again than it would have taken to do it by hand

The rolling-release life is that things break constantly, during each week’s upgrade, but only a little bit at a time (and hopefully in staging). I don’t know if this is better for system administration, necessarily, but if you’re used to a stable-release dynamic of heavy discrete breakage and piles of backported patches, then you might be imagining the same scale of breakage every upgrade, which is isn’t the experience at all. So don’t discard the rolling-release option because of this preconception.

The difference is: If there is a weekly breakage on a weekly update, delaying with it is just part of the process and timed in.

If you only update every few years, each update becomes a full project distracting from and conflicting with other projects.

> The difference is: If there is a weekly breakage on a weekly update, delaying with it is just part of the process and timed in.

That entirely depends on your operations model. There's a difference between, say, a nuclear power plant and a colo web hosting shop. With the latter, sure, no problem risking "minor" weekly breakage. With the former, I'd much rather have scheduled, heavily tested and carefully monitored maintenance windows.

And HN tends to underestimate the number of places like the former exist. Backbones of global finance and telecom, industrial facilities of all kinds, etc.

A nuclear power plant control software hopefully isn't connected to external systems, but fully isolated.

And yes, upgrading that is a full blown project.

It is very different from a "living" software environment with ongoing development processes.

So how about a more down-to-earth example. Medical imaging.

Changes to graphics drivers, can, do, and have impacted how things like MRI results get rendered by software. It's going to have at least some networking with the rest of the hospital and difficult to completely airgap, but at the same time you cannot update it willy nilly with the latest and greatest uncertified drivers.

That's precisely where "enterprise" distros are sometimes necessary.

It is the countless places that exist at some level between your two extremes of the nuclear power plant and the colo web host that need Enterprise Linux.
My point when I mentioned rolling-release was that I’m very unsure whether they need Enterprise Linux and a bout of desperate firefighting every several years or a rolling-release distro and a small but respected team with a staging environment and a steady supply of handheld fire extinguishers.

I could be convinced the former were the answer in many cases, but I’ve never seen the argument for that move beyond a bare proclamation like you’ve just made. It may also be that the argument for this is so mired in the particulars of a situation that it essentially can’t be articulated, but for me that’s mutually exclusive with it applying to a broad, vaguely defined classes of deployments like you’ve just done. (A complex argument is bound to produce an intricately shaped class.)

My (admittedly theoretical) fear in my initial comment was that the LTS people read about kernel updates every month, think of the breakage their LTS encounters on every kernel update, and nope out. Yet without the humongous pile of backported patches the LTS requires kernel updates are the most benign thing in the world if you can afford the reboots—a decade of them with literally not a single issue can and does happen.

>> I'd like to see the RHEL stability model go away too and force people to complete their automation and solve the problems of being able to rebuilding on demand - and actually doing it.

So how could this be accomplished?

A Nix-OS style approach coupled with an immutable OS core?

It is much harder to offer stability guarantees than to just publish updates in a rolling release fashion. And yet big organizations pay big money for that stability or the support to poke an enterprise software provider to get stuff working for their needs.

It won't be. Not until comprehensive integration testing comes as standard.

If you dont have comprehensive integration tests it's less risky just to not upgrade unless you really need to.

>> If you don't have comprehensive integration tests it's less risky just to not upgrade unless you really need to.

Exactly. Most of us do not have sqlite-level (https://www.sqlite.org/testing.html) testing so we just do the best we can with the resources we have.

This tends to make most "enterprise" shops risk-averse and the motto becomes "if it's not broke, don't fix it".

ArchLinux-style rather thsn NixOS style. Just roll the updates when they are ready into your very own test, int, acc, and finally prod.
>> ArchLinux-style rather thsn NixOS style. Just roll the updates when they are ready into your very own test, int, acc, and finally prod.

The issue is that a rolling-release approach does not have stability guarantees and forces everything to be upgrading all the time.

This does not work very well if you have specialized hardware or scientific equipment. If the drivers for your lab equipment work with a given release of an enterprise linux, you can't just jump on the next release until you have working drivers ready.

The same is true if you are working with some enterprise software which is only certified to work with a given release of an enterprise linux. Would you really want to run business critical software on a version of the operating system which is not (yet) supported by the vendor?

All those hours hunting for the reasons why something suddenly stopped working every two or three weeks need to be paid. So maintenance cost for Linux servers would either skyrocket or no updates would ever be done for years. There are very good reasons why rolling releases in infrastructure are basically a no-go.
Didn't Google just switch to rolling releases?

Also shout-out to Arch ... I've been using it since ... forever and never really had an issue in update.

Same here, I ran Arch on Hetzner Cloud and find it superior in many subtle ways.

Rolling updates mean that I have to carefully choose what to install in order to keep the maintenance costs down. This has the side effect of reducing the attack surface.

I also manually review updates, which means that I keep up with the news in OS land.

I have to reboot once in a while because of updates, which means that I test resilience of my infrastructure.

By comparison, RedHat stack at work feels creepled and ancient.

You need a RedHat account (read: subscription) for pretty much everything, even the most basic documentation or downloads and yet my bugs in their bugzilla linger for months with none even trying to reproduce, let alone fix.

Every single time I tried using arch, I had nothing but problems with updates. I wasted a lot of time hunting down documentation to things that broke, like mailing, logging, the DE, bluetooth, you name it. Changes that get taken care of by default in other distros. I had some very nasty surprises while using arch. My stable Ubuntu or Debian installs didn't even have a single glitch in the exact same timeframe.
How about Qubes OS style, where everything runs in a VM?
>> How about Qubes OS style, where everything runs in a VM?

How does Qubes OS work with drivers for specialized hardware such as scientific lab equipment?

Depends, I think. I remember you being able to finagle a passthrough of devices, the underlying software can do that with little issue and once passed through it shouldn't be an issue, but I vaguely remember there being some notion of that being dissuaded. Mostly because of the increased attack surface, I think. Though that was a few years ago now.

Qubes OS doesn't really solve much in regards to stability and work required to update in a professional setting though, I'd say.

> but I vaguely remember there being some notion of that being dissuaded. Mostly because of the increased attack surface, I think.

Using a GPU passthrough indeed decreases the security, but it is still much more secure than anything else. More details: https://groups.google.com/group/qubes-devel/browse_frm/threa...

> Qubes OS doesn't really solve much in regards to stability and work required to update in a professional setting

Couldn't disagree more. All software runs in VMs. The Admin VM never goes to the Internet or runs anything. Therefore it's less necessary to update and reboot it: https://www.qubes-os.org/doc/supported-releases/#note-on-dom...

All VMs are easy to backup/restore in a few clicks; cloning for testing and upgrading are amazingly smooth. All this with a great GUI.

Passing through a device is more secure than bare-metal, yes, but I meant it as the Qubes OS project themselves dissuading the notion. And doing so decreases security for the whole system, which is why they advice against it, because you've now given a potential bridge back to the main system. Not that there's been many exploits for that yet, but if such systems were more common, there would be.

Regardless, even with the ease of virtual machines, Qubes OS doesn't really solve problems involved in professional management, nor should it be considered for that given the overhead of the system. It's a neat system, and for general purpose use it's pretty cool, but for stability and work required to update a system, it really doesn't.

Sure, it makes general use-cases pretty easy to update, but those aren't really much of a problem if you've set up a PXE server and have a base image together with default configuration. The issues occur when you're updating servers, which is what I was talking about, because the end-user isn't that important in this context.

Whether you have it running in a hypervisor or bare metal, it all comes back to properly configured backups. Qubes OS doesn't solve this, nor is it meant to. It increases security at the cost of convenience and complexity. Nor does it solve stability in a professional environment, because while it does give you an isolated OS per application or stack of applications, you've now increased your maintenance surface. Servers need to be updated, within a reasonable time frame of ones being provided, and general computing often requires more of that too.

And at the end of the day, while I do like Qubes OS and do like virtual machines, they're not the be all, end all, in regards to security. Exploits exist, and as with all things, the more common they become, the more will be made.

I do still hope for a system like Qubes OS in the future, just not Qubes OS.

Usually devices are connected to specific VMs and the drivers are installed inside them. VMs can run Lunux or Windows. See this: https://www.qubes-os.org/doc/how-to-use-devices/
Likewise, it's been suggested that the 19.6-year (1024-week) GPS epoch is pessimal. Rollover is infrequent enough to be ignored, but frequent enough to actually happen and cause problems.

Folks who know such things better than I do, have suggested that it would've been far better at like a 64-week rollover (or just chop it to 52 and leave part of the code space unused), that way everyone would have to have a plan for it. Nobody could claim they don't expect their hardware to be in use 64 weeks in the future therefore they can ignore rollover.

Funnily enough, I had the Unix epoch time question come up with a customer (who makes very long-lived pseudo-embedded systems) come up in discussion last week.
That is a good point, however I've not heard of too many cases where organizations intentionally skip RHEL releases. Systems that are being actively developed do regularly upgrade through each RHEL release, and the 10 year support just lets them be lazy about how quickly they do so. The only systems I see intentionally riding out the 10+ year support are deprecated systems that are already announced to sunset by the time RHEL support ends.

The five year reign of RHEL7 was too long and did result in the very issues you bring up, but the ~3 year duration of RHEL 5,6 & 8 was short enough to avoid problems due to attrition in enterprise settings (unlike startups which have higher turnover, and not counting bus factors of one - no release cycle can't solve that).

And like others have pointed out, automation doesn't help as much when moving between releases. We have everything configuration controlled with kickstart and ansible and/or docker, and it is great for reproducibility within a release cycle, but it doesn't save much time or knowledge between releases. And Ubuntu is even worse in that regard despite having a shorter release cycle.

It's one of the things that ground Yahoo to a halt. We spent years migrating from RHEL-4 to 6, then RHEL-6 to RHEL-7, and by the time the projects were pretty much complete, the next sunset was approaching. My cynicism comes from seeing the bad things that "Enterprise Linux" enabled there.

Admittedly, Yahoo was an extreme case. It never solved the really building problem - the culture from the early days was to compile, ship and forget. Once a RHEL-6 package was pushed to our dist/yinst system (packages), it would never be rebuilt unless it was 1) necessary, or 2) It was time to try and figure out how to build it on RHEL-7.

A lot of effort was spent in the later years to try and address this (by burning the old tech stack to the ground), but the culture was pervasive for the longest time. If 10-year-RHEL didn't exist we would have been forced to address the building processes.

If it's hard or error prone, then do it frequently until you get the process nailed down.

If it's hard or error prone, then do it frequently until you get the process nailed down.

Major life lesson -- practice makes perfect.

Practice makes something permanent. Wether that is perpetual perfection or perpetual mediocrity depends on the person.

The average person could practice violin for 500 years and never be invited to play Carnegie Hall.

If the average person were in an exceptionally good environment, they would likely get (much) better over time.
Oh my sweet summer child. There are plenty of enterprises still on RHEL 6, paying extra to keep patches coming. I bet there are still some on RHEL 5. Large, global companies. And of course there are many environments where software and hardware stay frozen for years. Telecoms, factory floors, automotive, finance, disconnected edge devices of all types... they test the absolute crap out of those systems, and pay for every certification availabl, and then leave them running for a decade.

The ability to stay abreast of major version updates is a wonderful attribute of many environments, but definitely far from all.

In 2017 I led a (painful) project to migrate from RHEL5 to Ubuntu 16. Since then, it has been pretty easy to go to 18 then 20, soon 22. The previous migration was in 2010 and was from RedHat 6 (not RHEL) to RHEL5. These projects were for projects that are "ghost deprecated" in that not much time is spent in talking about them but they are critically important to the business, as profit centers and cost drivers, but not the new flashy stuff. So we saved $10s of millions of dollars in hardware savings mostly due to the improved schedulers in 4.0.x series of kernels compared to 2.6.28 kernel. The same image became the basis of the containerized version that was rolled out a bit later.
22 might be difficult depending on what you use from Ubuntu 's repositories; they are converting apt apps to snaps.
>That is a good point, however I've not heard of too many cases where organizations intentionally skip RHEL releases. Systems that are being actively developed do regularly upgrade through each RHEL release, and the 10 year support just lets them be lazy about how quickly they do so.

Industry specific - but finance world, we still had straggling RHEL5 machines up until a year or two ago and still have a bunch of RHEL6 machines and have basically NO RHEL9. The vast majority of the machines are all sitting on RHEL7/8.

I feel somewhat called out. By time we finished migrating from centos6 to centos8, centos8 was being shot in the face. Talk about "fun times".
>I've not heard of too many cases where organizations intentionally skip RHEL releases.

I assume you mean major releases. Less common than minor releases but, especially for air-gapped equipment, it's not that unusual.

Don't you end up with the same problem with the automation that has been running fine for 5 years, then suddenly breaks? And the person that set it up is either gone, or has no clue how they did it 5 years ago.
Recently saw a (thankfully not mission-critical) old k8s cluster fall down with absurd incompatibilities between node versions, cluster versions, and cert-manager versions - all of which only support upgrades one version at a time. Even infrastructure-as-code doesn’t save you if you need to upgrade something but don’t have the time and expertise (and esoteric changelog knowledge!) to reliably upgrade everything else.
Before LE almost no one automated SSL cert refresh. Depending on your SSL cert vendor you couldn't automate things even if you wanted to. It's not that the automation ran fine for five years, it's that you'd be lucky if the manual process last done 5 years ago was even documented.

SSLMate is about as old as LE, they both started around the same time.

The idea is that you deploy from scratch all the infrastructure every 6 months, first to testing and then to production.
All you damn kids work in a different industry than I do.
Every 6 months? That seems like a pretty long window for tribal knowledge to get lost. Is 6 months arbitrary or is there some reasoning behind that cadence?
Arguably, tribal knowledge and the dependence on it need to be managed as much as anything else.
That's an amazing amount of effort expended on something that provides exactly zero revenue. I understand the concept, but I've never been fortunate enough to work in a business where that was practical.
LE by default successfully ran 2 months prior. 2 months and 5 years are two completely different worlds in terms of bit rot. That and there are many generic tutorials and scripts and knowledgeable devs for configuring LE fresh.
> complete their automation

There are too many places where the current guard is going to have to die off before they even _start_ automation. So we're looking at 20-30 years.

LTS is the actual sane solution for these places, despite how utterly insane it is.

Some time ago I was listening a guy talking about (operational) risk management in financial industry. One of his main points was that the systems in financial organizations should not be completely bug-free and automated. Because when something eventually happens, if there is no-one who has had to fix issues in the systems regularly, there is nobody around who can fix the system efficiently.

An example of argument that belongs to a weird class of arguments you at the same time want to agree and disagree.

I wonder if there would be a market for an enterprise-grade server microkernel OS. It's not the 90s anymore - Nintendo and QNX are shipping tens of millions of microkernel installs every year; and hardware is fast enough that choosing correctness and security over speed is a valid tradeoff. Maybe if I win the lottery...
These things tend to trade 200% performance for 10% security, though. That's not a tradeoff I am comfortable with in anything like all situations.
Not necessarily if you build them right. Nintendo’s Switch is a true microkernel and, if it cost 200% performance, there’s no way it would be viable on a 2015 Tegra X1. The 200% thing is kind of a myth that doesn’t apply to modern practice - now it’s more like 10%.

As for 10% security - it’s more than 10%. Take my same example, the Switch. No bugs have been found to launch unapproved software in the last 4 years. There’s always the Secure Boot bug by NVIDIA in earlier consoles, but not even a WebKit bug will get you homebrew on a Switch. Kind of a big deal…

Another example of this would be Microsoft’s experiments with what would happen if an OS was built with all apps running in managed code - no compiled apps. Performance cost? They got it down to just 7% (though, admittedly, Midori never shipped, but it did host Bing in a few countries for a few years.)

Kaspersky recently developed their own proprietary microkernel OS. AFAIK they target it for IoT, but kernel is kernel, probably could be used with ordinary servers as well.

Main issue is drivers, of course. It's hard to beat Linux. It contains open source drivers and server vendors usually target Linux and Windows with their driver efforts.

> LetsEncrypt did us a huge favor by forcing automation vs having the guy who knows how to update the SSL certs every 4.9 years and left 6 months ago.

Not in my experience. There's still a guy who goes around and updates (manually) all the LetsEncrypt certificates every year.

> Not in my experience. There's still a guy who goes around and updates (manually) all the LetsEncrypt certificates every year.

LetsEncrypt certificates don't last for one year, they only last for 90 days, no exceptions. You may be thinking about something different.

They may be talking about the certbot software itself, which does the updating of certs.
We truly live in amazing times! We have language models that sound human and internet from space, but never bothered to schedule that script for updating TLS certs. Or put it in version control for that matter.

Sounds like my org :)

Shouldn't he be going around every couple of months?
Having worked in both kinds of cultures, I tend to agree. Keeping up is ultimately less pain than trying to upgrade things in huge chunks.

But it can be really hard to change the culture at a place that has a long history of "ain't broke, don't fix it" engineering.

Less pain yes but more efficient? I'm not sure.

The places that don't do constant upgrades also don't usually have teams looking after that. If they time it right they can do with less people.

Of course it's less reliable not having as much active knowledge but I do think it can be cheaper if nothing goes wrong.

Yes, that can be the tradeoff, and a reasonable one in the right circumstances. Some projects are like that where there is really no team dedicated to anything more than keeping the lights on.

In my comment, I was thinking of well staffed (or at least close-enough to well staffed) teams making deliberate decisions to defer.

Long term support allows bad behavior but I think it's still useful to reduce the amount of feature/breaking changes happening to software.

Those problems can also be mitigated with mandatory environment rebuilds which is trivial for a lot of setups with infrastructure as code.

At the extreme end, you have Kubernetes/CNCF where 6 months go by and you're many versions behind with a huge changelog of breaking changes you have to fix first. Stable APIs and stable ABIs are very useful here (which enterprise Linux provides).

In my experience updates are not forgotten. Even automated.

Upgrades however are a different story. The major version changes require a ton of testing and manual massaging. This is why enterprises like to have that infrequently. For the systems that are easier you can still choose to follow the releases quickly.

Because security patches are being back ported is usually not a real issue.

> I'd like to see the RHEL stability model go away too and force people to complete their automation and solve the problems of being able to rebuilding on demand - and actually doing it.

In this model, what happens when the next Python2->Python3 breaking change comes along?

Using whatever Python your distribution needed is bad practice. Own your application environment, there are plenty of ways to do this, such as Nix and Docker, which make your Python environments reproducible across systems.

Also, Enterprise Linus is one of the reasons (definitely not the only) that the migration took such a long time. Too many enterprise shops that stuck with Python 2 because it's the lazy thing to do. The tech debt grows every year you don't move with the ecosystem.

>Using whatever Python your distribution needed is bad practice. Own your application environment, there are plenty of ways to do this, such as Nix and Docker, which make your Python environments reproducible across systems.

How far down does "own your application environment" extend? How about libc? What is the role of the underlying OS?

>> How far down does "own your application environment" extend?

It depends on the needs of your application.

>> How about libc?

If you need to make sure the underlying libc has what you need, you must either bring your own libc or have sufficient feature test macros and adapters to account for possible differences.

>> What is the role of the underlying OS?

It depends on what the application requires. What operating system features, if any, do you require? Do you have any timing or scheduling requirements that are sensitive for your application? Do you need real-time responsiveness?

How does the operating system handle failure scenarios? What guarantees, if any, does it make when hardware fails? Is it okay for your application to crash if a portion of the computer's memory or disk borks?

> What is the role of the underlying OS?

Ideally none, with scratch containers for applications and the bare minimum running under your orchestrator which becomes your main interface.

I have a customer with systems so old they weren’t at risk for heartbleed. He was excited about that.
Heh, same but for the Java Log4j vulnerability. "We haven't upgraded in 10 years, and it's secure from that!"
That was anybody who used Debian stable
> There is another problem that wasn't covered in the article. The 10+ years of stability leads to behaviors and outcomes that remind me of the long-lived SSL certificate problem. Updating is done so infrequently that the "how?" is forgotten. As the 10 year support limit approaches, most of the old team members who did it last time are gone, tech debt is through the roof, few people know where everything is or how to build it, and so on.

This is known as the out-of-the-loop performance problem. [1]

[1] https://en.wikipedia.org/wiki/Out-of-the-loop_performance_pr...

I agree but I don't think that's SUSE's or Red Hat's problem. If you deliver a solid and stable product humans will get complacent.
How is the problem of TLS certificates related to Redhat Linux or Enterprise Linux? I think these are orthogonal problems.
I think it was an analogy. If you don't do a thing for a long time (updating SSL certificates, updating a Linux system to a major new version), the knowledge of how the systems were maintained/built gets lost. If you have an automated, repeatable process that moves with the times, it is more likely that the process is codified (either in documentation or in infrastructure as data) and easy to repeat.
If certificate renewal is automatic why do it at all ...
To ensure that those who possess the certificate, still control the domain.

The main issue is that certificates should really be automated by every web server by default. At least for those with public IP addresses. There are servers like Caddy which implemented it, but it should be basic feature that just works without any additional configuration.