| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by aeldidi 218 days ago

I'm becoming concerned with the rate at which major software systems seem to be failing as of late. For context, last year I only logged four outages that actually disrupted my work; this quarter alone I'm already on my fourth, all within the past few weeks. This is, of course, just an anecdote and not evidence of any wider trend (not to mention that I might not have even logged everything last year), but it was enough to nudge me into writing this today (helped by the fact that I suddenly had some downtime). Keep in mind, this isn't necessarily specific to this outage, just something that's been on my mind enough to warrant writing about it.

It feels like resiliency is becoming a bit of a lost art in networked software. I've spent a good chunk of this year chasing down intermittent failures at work, and I really underestimated how much work goes into shrinking the "blast radius", so to speak, of any bug or outage. Even though we mostly run a monolith, we still depend on a bunch of external pieces like daemons, databases, Redis, S3, monitoring, and third-party integrations, and we generally assume that these things are present and working in most places, which wasn't always the case. My response was to better document the failure conditions, and once I did, realize that there was many more than we initially thought. Since then we've done things like: move some things to a VPS instead of cloud services, automate deployment more than we already had, greatly improve the test suite and docs to include these newly considered failure conditions, and generally cut down on moving parts. It was a ton of effort, but the payoff has finally shown up: our records show fewer surprises which means fewer distractions and a much calmer system overall. Without that unglamorous work, things would've only grown more fragile as complexity crept in. And I worry that, more broadly, we're slowly un-learning how to build systems that stay up even when the inevitable bug or failure shows up.

For completeness, here are the outages that prompted this: the AWS us-east-1 outage in October (took down the Lightspeed R series API), the Azure Front Door outage (prevented Playwright from downloading browsers for tests), today’s Cloudflare outage (took down Lightspeed’s website, which some of our clients rely on), and the Github outage affecting basically everyone who uses it as their git host.

3 comments

HardwareLust 218 days ago

It's money, of course. No one wants to pay for resilience/redundancy. I've launched over a dozen projects going back to 2008, clients simply refuse to pay for it, and you can't force them. They'd rather pinch their pennies, roll the dice and pray.

stinkbeetle 218 days ago

> It's money, of course.

100%

> No one wants to pay for resilience/redundancy. I've launched over a dozen projects going back to 2008, clients simply refuse to pay for it, and you can't force them. They'd rather pinch their pennies, roll the dice and pray.

Well, fly by night outfits will do that. Bigger operations like GitHub will try to do the math on what an outage costs vs what better reliability costs, and optimize accordingly.

Look at a big bank or a big corporation's accounting systems, they'll pay millions just for the hot standby mainframes or minicomputers that, for most of them, would never be required.

solid_fuel 218 days ago

> Bigger operations like GitHub will try to do the math on what an outage costs vs what better reliability costs, and optimize accordingly.

Used to, but it feels like there is no corporate responsibility in this country anymore. These monopolies have gotten so large that they don't feel any impact from these issues. Microsoft is huge and doesn't really have large competitors. Google and Apple aren't really competing in the source code hosting space in the same way GitHub is.

collingreen 218 days ago

> Take the number of vehicles in the field, A, multiply it by the probable rate of failure, B, then multiply it by the result of the average out of court settlement, C. A times B times C equals X. If X is less than the cost of a recall, we don't do one.

https://youtu.be/SiB8GVMNJkE

csomar 218 days ago

> Look at a big bank or a big corporation's accounting systems

Not my experience. Any banking I used, in multiple countries, had multiple and significant outages and some of them where their cards have failed to function. Do a search of "U.S. Bank outage" to see how many outages have happened so far this year.

closeparen 217 days ago

Modern internet company backends are very complex, even on a good day they're at the outer limits of their designers' and operators' understanding, & every day they're growing and changing (because of all the money and effort that's being spent on them!). It's often a short leap to a state that nobody thought of as a possibility or fully grasped the consequences of. It's not clear that it would be practical with any amount of money to test or rule out every such state in advance. Some exciting techniques are being developed in that area (Antithesis, formal verification, etc) but that stuff isn't standard of care for a working SWE yet. Unit tests and design reviews only get you so far.

Jenk 218 days ago

I've worked at many big banks and corporations. They are all held together with the proverbial sticky tape, bubblegum, and hope.

They do have multiple layers of redundancies, and thus have the big budgets, but they won't be kept hot, or there will be some critical flaws that all of the engineers know about but they haven't been given permission/funding to fix, and are so badly managed by the firm, they dgaf either and secretly want the thing to burn.

There will be sustained periods of downtime if their primary system blips.

They will all still be dependent on some hyper-critical system that nobody really knows how it works, the last change was introduced in 1988 and it (probably) requires a terminal emulator to operate.

stinkbeetle 218 days ago

I've worked on software used by these and have been called in to help support from time to time. One customer which is a top single digit public company by market cap (they may have been #1 at the time, a few years ago) had their SAP systems go down once every few days. This wasn't causing a real monetary problem for them because their hot standby took over.

They weren't using mainframes, just "big iron" servers, but each one would have been north of $5 million for the box alone, I guess on a 5ish year replacement schedule. Then there's all the networking, storage, licensing, support, and internal administration costs for it which would easily cost that much again.

Now people will say SAP systems are made entirely of dict tape and bubblegum. But it all worked. This system ran all their sales/purchasing sites and portals and was doing a million dollars every couple of minutes so that all paid for itself many times over during the course of that bug. Cold standby would not have cut it. Especially since these big systems take many minutes to boot and HANA takes even longer to load from storage.

lopatin 218 days ago

I agree that it's all money.

That's why it's always DNS right?

> No one wants to pay for resilience/redundancy

These companies do take it seriously, on the software side, but when it comes to configurations, what are you going to do:

Either play it by ear, or literally double your cloud costs for a true, real prod-parallel to mitigate that risk. It looks like even the most critical and prestigious companies in the world are doing the former.

macintux 218 days ago

> Either play it by ear, or literally double your cloud costs for a true, real prod-parallel to mitigate that risk.

There's also the problem that doubling your cloud footprint to reduce the risk of a single point of failure introduces new risks: more configuration to break, new modes of failure when both infrastructures are accidentally live and processing traffic, etc.

Back when companies typically ran their own datacenters (or otherwise heavily relied on physical devices), I was very skeptical about redundant switches, fearing the redundant hardware would cause more problems than it solved.

paulddraper 218 days ago

Complexity breeds bugs.

Which is why the “art” of engineering is reducing complexity while retaining functionality.

1718627440 217 days ago

I'm not sure, it's only money. People could have a lot of simpler cheaper software, by relying on core (OS) features instead of rolling there own, or relying on bloated third-parties, but a lot don't due to cargo culting.

raxxorraxor 217 days ago

And tech hype. Infrastructure to mitigate here isn't expensive. In many cases quite the opposite. The expensive thing is that you made yourself dependent on these services. Sometimes this is inevitable, but to host on GitHub is a choice.

Wowfunhappy 217 days ago

…can I make the case that this might be reasonable? If you’re not running a hospital†, how much is too much to avoid a few hours of downtime around once a year?

† Hopefully there aren’t any hospitals that depends on GitHub being continuously available?

ForHackernews 218 days ago

Why should they? Honestly most of what we do simply does not matter that much. 99.9% uptime is fine in 99.999% of cases.

porridgeraisin 218 days ago

This is true. But unfortunately the exact same process is used even for critical stuff (the crowdstrike thing for example). Maybe there needs to be a separate swe process for those things as well, just like there is for aviation. This means not using the same dev tooling, which is a lot of effort.

roxolotl 218 days ago

To agree with the comments it seems likely it's money which has begun to result in a slow "un-learning how to build systems that stay up even when the inevitable bug or failure shows up."

suddenlybananas 218 days ago

To be deliberately provocative, LLMs are being more and more widely used.

zdragnar 218 days ago

Word on the street is github was already a giant mess before the rise of LLMs, and it has not improved with the move to MS.

dsagent 218 days ago

They are also in the process of moving most of the infra from on-prem to Azure. I'm sure will see more issues over the next couple months.

https://thenewstack.io/github-will-prioritize-migrating-to-a...

array_key_first 217 days ago

I don't know anything about githubs codebase, but as a user, their software has many obvious deficiencies. The most glaring being performance. Oh my God, github performs like absolute shit on large repos and big diffs.

Performance issues always scare me. A lot of the time it's indicative of fragile systems. Like with a lot of banking software - the performance is often bad because the software relies on 10 APIs to perform simple tasks.

I doubt this is the case with GitHub, but it still makes you wonder about their code and processes. Especially when it's been a problem for many years, with virtually no improvement.

Tadpole9181 218 days ago

To be deliberately provocative, so is offshoring work.

blibble 218 days ago

imagine what it'll be like in 10 years time

Microsoft: the film Idiocracy was not supposed to be a manual