Hacker News new | ask | show | jobs
by mistahenry 2132 days ago
I’ve worked in a number of areas with basically no fail/delay SLAs. I think it’s naive to think “if you need a hot fix right now, you’re doing it wrong”...the number of times we needed to hot fix because of ourselves was very low. But when you’re in an integration heavy environment and one of the many moving parts (outside of your control) breaks, well thought out “put the fire out” stopgaps on the server consistently save the day (and the company money by not breaching the SLA)
2 comments

That makes perfect sense and it's definitely true that sometimes the hotfix is not a bug in your code (which can be solved by a rollback) but instead having to patch a problem in a dependent system. But that seems orthogonal to the container issue. Shelling into a live server and changing something only works if you have the entire build toolchain on the production server which hasn't generally been the case in my experience. Even if you aren't using containers you still need to build artifacts and deploy them. It's just that you are deploying binary artifacts instead of containers. It doesn't seem like the container builds are the real long pole in that process.
Redeploy the older working version?
"outside your control" is key here. you're assuming a rollback would work. in many cases, some external system changes without your knowledge, and you're only seeing those changes on production.

I've got a client that has data feeds from multiple vendors. some are pulls, some are... "hey, we'll FTP this file to you". the file format has changed - unannounced - at least 3 times in the past... 15 months. Then something breaks on production, but you don't know what. You need to get on that machine and take a look.

"Redeploy the older working version" doesn't do anything except re-introduce more problems in these instances.

This is a good point. There are probably lots of people on HN working in cloud environments where your dependencies are actually organizationally within your control. If one of your dependencies makes a change that breaks you, you can escalate the problem and compel them to roll back the change. This is the luxury of building the entire world. My service depends on nothing that can't be escalated to my own VP, so "roll back to the old version [of whatever changed]" is a very satisfying answer, but it's not an option when your dependencies aren't obligated to keep you running.
I pity anyone whose system needs less than X downtime per month, but who depends the constant availability of an external system that is down for more than X per month :)
> Then something breaks on production, but you don't know what. You need to get on that machine and take a look.

If that's a problem you find yourself having at all, much less regularly. You have a serious observability problem.

Isn’t the larger issue that your production environment can be brought down by bad user input?
There's breakage outside "brought down". A system that's running but doesn't produce outputs because the input data changed can be "broken" and violating its SLA too. And not really something you can design around outside "we won't promise anything", but then you loose to competitors that do take it on themselves to react quickly enough with hotfixes.
How fast can you grow if you’re constantly putting out fires? It sounds like you are in a B2B world. Businesses always want more. Where are the sales people/customer service managers that can set realistic expectations on what the client requires vs. what they are expected to do? “logging into production and manually correcting stuff” can only go so far and doesn’t scale.