Hacker News new | ask | show | jobs
by pavas 1008 days ago
My team's systems play a critical role for several $100M of sales per day, such that if our systems go down for long enough, these sales will be lost. Long enough means at least several hours and in this time frame we can get things back to a good state, often without much external impact.

We too have manual processes in place, but for any manual process we document the rollback steps (before starting) and monitor the deployment. We also separate deployment of code with deployment of features (which is done gradually behind feature flags). We insist that any new features (or modification of code) requires a new feature flag; while this is painful and slow, it has helped us avoid risky situations and panic and alleviated our ops and on-call burden considerably.

For something to go horribly wrong, it would have to fail many "filters" of defects: 1. code review--accidentally introducing a behavioral change without a feature flag (this can happen, e.g. updating dependencies), 2. manual and devo testing (which is hit or miss), 3. something in our deployment fails (luckily this is mostly automated, though as with all distributed systems there are edge cases), 4. Rollback fails or is done incorrectly 5. Missing monitoring to alert us that issue still hasn't been fixed. 5. Fail to escalate the issue in time to higher-levels. 6. Enough time passes that we miss out on ability to meet our SLA, etc.

For any riskier manual changes we can also require two people to make the change (one points out what's being changed over a video call, the other verifies).

If you're dealing with a system where your SLA is in minutes, and changes are irreversible, you need to know how to practically monitor and rollback within minutes, and if you're doing something new and manually, you need to quadruple check everything and have someone else watching you make the change, or its only a matter of time before enough things go wrong in a row and you can't fix it. It doesn't matter how good or smart you are, mistakes will always happen when people have to manually make or initiate a change, and that chance of making mistakes needs to be built into your change management process.

1 comments

>My team's systems play a critical role for several $100M of sales per day, such that if our systems go down for long enough, these sales will be lost.

Would they? Or would they just happen later? In a lot of cases in regular commerce, or even B2B, the same sales can often be attempted again by the client for a little later, it's not "now or never". As a user I have retried things I wanted to buy when a vendor was down (usually because of a new announcement and big demand breaking their servers) or when my bank had some maintainance issue, and so on.

It's both (though I would lean towards lost for a majority of them). It's also true that the longer the outage, the greater the impact, and you have to take into account knock-on effects such as loss of customer trust. Since these are elastic customer-goods, and ours isn't the only marketplace, customers have choice. Customers will typically compare price, then speed.

It's also probably true that a one-day outage would have a negative net present value (taking into account all future sales) far exceeding the daily loss in sales, due to loss of customer goodwill.

It would be a serious issue for in person transactions like shops, supermarkets, gas stations, etc

Imagine Walmart or Costco or Chevron centralised payment services went down for 30+ mins. You would get a lot of lost sales from those who don’t carry enough cash to cover it otherwise. Maybe a retailer might have a zapzap machine but lots of cards aren’t imprinted these days so that’s a non starter too.

Not just lost sales. I've seen a Walmart lose all ability to do credit card sales and after about 5 minutes maybe 10% of people waiting just started leaving with their groceries in their cart and a middle finger raised to the security telling them to stop.
That's some low class rogue behavior though, not the standard in sales ("they can't process my card, let me take the stuff for free anyway").
> Maybe a retailer might have a zapzap machine but lots of cards aren’t imprinted these days so that’s a non starter too.

When I Google "zapzap machine" this comment is the only result, but after looking around on Wikipedia, I see this is a typo for "zipzap".

Is this really the only time in history someone has typoed zipzap as zapzap? I guess so.

For anyone who is still confused: https://en.wikipedia.org/wiki/Credit_card_imprinter
Haha yeah I guess so! Last time I used one was in the previous millennium.
It depends on the business. It's not uncommon for clients to execute against different institutions' systems, and they can/would re-route flow to someone else if you're down.

Think less "buying a car" and more "buying a pint of milk". If you're buying a car and the store is closed, you might come back the next day. If you're buying milk you will just go to the store down the street.

I imagine same with time based or opportunistic businesses. If the shopping channel (assuming it runs around the clock) couldn't process orders, they'd have to decide if they want to forgo selling other products to rerun the missed ones.

For certain types of entertainment like movies or sports, the sale may no longer be relevant.