Hacker News new | ask | show | jobs
by jkaptur 1458 days ago
It's very interesting that by building a system that's more resilient and reliable:

> high profile processes (such as POS) implement their own fallback processes to handle the possibility of issues with the SDM system in store. In the case of item data, the POS software on each register is capable of bypassing the SDM Proxy and retrying its request directly to the ILS API in the data centers.

... the system as a whole became much more complex and difficult to observe. The system was running in a degraded, abnormal, less-tested, fallback mode for days without anyone caring.

This is also a point about the normalization of deviance. When there is a background rate of the POS using the fallback path, who is to say how important an increase in that rate might be?

2 comments

Buried under another thread was this post: https://aws.amazon.com/builders-library/avoiding-fallback-in... which is the exact same issue - a cache miss was backed up by a direct query and it took down all of Amazon trying to display shipping times.

grok say complexity bad

Fallback is not always necessary (sometimes it is, you can't just say "whelp the engines on this plane went out, time to die") but when you have a fallback system you should think about why you have it and how bad it is to fail, and if it could be worse to succeed.

I find all of this fascinating.

A few years ago, the guys who built Chick-fil-a's POS fog were on HN talking about their fault-tolerance and transaction queueing. It was quite interesting.

There's a lot that you can learn from high-volume POS system design that applies to just bog-standard every day programming.