Hacker News new | ask | show | jobs
by TomBombadildoze 1849 days ago
Good gravy, this is a lot to unpack. It's an alarming story from the very beginning, and a cautionary tale of how tempting it is to do everything with Jenkins, even though it's an appropriate tool for absolutely nothing in the Year of our Lord 2021.

> As part of our automation setup, we continuously run integrity jobs to inspect our Jenkins nodes.

Why on earth would you self-host this in Jenkins? This is a monitoring and alerting problem.

> These jobs check system configurations and properties and look to see if any node is failing those checks.

What year is it? We've solved this with immutable infrastructure or system integrity monitoring. Or both.

> The checks automatically mark Jenkins nodes as offline when any of those checks fail and notifies our Mobile Build & Release team via a Slack message.

"Mark" offline? Why not just terminate it? And why do we care if build nodes come and go? These should be cattle, not pets. If they all die at once, that's bad. If they're cycling in and out, that's business as usual.

> When our Jenkins UI stopped working, we noticed two things:

> 1. We had recently upgraded Jenkins and all its plugins to a newer version

Did they just now learn what an awful idea this is? All of this at once, really?

This isn't so much a Jenkins problem (though let's be clear, Jenkins is a problem) as it is a remedial engineering problem. The top takeaways should be "choose appropriate tools for the task at hand" and "don't make reckless decisions with brittle systems".

2 comments

> "Mark" offline? Why not just terminate it? And why do we care if build nodes come and go? These should be cattle, not pets. If they all die at once, that's bad. If they're cycling in and out, that's business as usual.

Given that they are for mobile builds, there might be some macOS nodes in there for iOS builds. These might be in-house machines they maintain -- or, if they use a cloud provider, there might be costs to just killing and spinning up nodes. For example, for EC2 Mac instances:

> EC2 Mac instances are available for purchase as Dedicated Hosts through On Demand and Savings Plans pricing models. Billing for EC2 Mac instances is per second with a 24-hour minimum allocation period to comply with the Apple macOS Software License Agreement.

if that's the case, just restart the failing nodes

and of course it's not that simple, they still have to customize the workflow

I think it's a frog boiling problem.

I start with building my code, then deploying it, then verifying the deployment, a few smoke tests, regression tests, pretty soon all of those concepts are crowding in on the brainspace of monitoring.

It's just one more thing, why slow down to learn a new tool and convince people to use it?

These days it's getting easier for me to requisition a machine to run a dev tool on. That hasn't always been the case, and I'm sure it's not the case everywhere.