|
|
|
|
|
by justapassenger
1429 days ago
|
|
This. People working in IT naturally think keeping IT systems up 100% time is most important. And depending on the business it often is, but it all costs money. Running a business is about managing costs and risks. - Is it worth to spend 20% more on IT to keep our site up 99.99% vs 99%? - Is it worth to have 3 suppliers for every part that our business depends, with each of them being contracted to be able to supply 2x more, in case other supplier has issues? And pay a big premium for that? - Is it worth to have offices across the globe, fully staffed and trained to be able to take on any problem, in case there's big electrical outage/pandemic/etc in other part of the world? I'm not saying that some of those outages aren't results of clowny/incompetent design. But "site sometimes goes down" can be often a very valid option. |
|
To them it's much easier to communicate "Hey, our customer service is going to be degraded on the second saturday in october, call on monday" to their customers 1-2 month in advance, prepare to have the critical information without our system, and have agents tell people just that.
This has really started to change my thoughts of how to approach, e.g. a major postgres update. In our case, it's probably the better way to just take a backup, shutdown everything, do an offline upgrade and rebuild & restore if things unexpectedly go wrong. We can totally test the happy case to death, and if the happy case works, we're done in 2 hours for the largest systems with minimal risk. 4 hours if we have to recover from nothing, also tested.
And you know, at that point, is it really economical to spend weeks to plan and weeks to test a zero downtime upgrade that's hard to test, because of load on the cluster?