| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by haroldl 1464 days ago

This was really interesting both in exploring the architecture of a retail system and looking at how systems fail. Better to read about it and learn than to live it.

I'd call it a 4 hour outage because the initial "recovery" was a result of cashiers manually typing in prices for items. Then when load decreased and they discovered that scanning items worked again the problem came right back.

Maybe returning 404 for both a cache miss and a "there's no endpoint at this path" error is an issue too. For other status codes there's a distinction between temporary and permanent failure; e.g. 301 versus 302. It would've been good to use HTTP 400 Bad Request for the misconfigured URL and 404 for a cache miss.

In the 10% of stores with the early roll out of the config change the cache hit rate went to 0 right away, and that started 12 days before the outage. Alerts on cache hit rates and per-store alerts would've caught that.

Then there were 4 days where traffic to the main inventory micro-service in the data center jumped 3x which took it to what appears to be 80% of capacity. Load testing to know your capacity limits and alerts when you near that limit would've called out the danger.

Then during the outage when services slowed down due to too many requests they were taken out of rotation for failing health checks. Applying back pressure/load shedding could have kept those servers in active use so that the system could keep up.

2 comments

jabart 1464 days ago

204 no content is an underused http status. 404 should be monitored as an error, 204 as, well no content available. If a status code has two responsibilities that will be a monitoring issue waiting to happen.

link

magicalhippo 1464 days ago

So your suggestion is that if I have a /invoice endpoint, a "GET /invoice/abc123" should return 204 if it's an invalid/non-existing invoice number?

Seems reasonable.

link

bombcar 1464 days ago

It seems absolutely insane to me that a system was designed and developed that allows taking down all registers in the country at once. I would have thought it would be designed to be much more "batch" oriented and the worst that could happen is you lose price updates and sales info unto the batches can get through again.

link

jaywalk 1464 days ago

If the local system doesn't have the item data (or thinks it doesn't have the item data, because it's looking in the wrong place) where exactly is it supposed to get the item data from if not the central system?

link

bombcar 1464 days ago

The total size of all item data for a store like Target can't be much more than what, a few gigabytes? Or at least the "UPC -> Price" dataset. So download the whole dataset each night, if you can't get delta changes to work.

And if that had failed somehow, it would have been noticed immediately upon the new code roll-out.

The internet was designed to be extremely resilient to host/route losses, we've made it so reliable we assume all machines are reachable at all times.

(To be fair, apparently they "do" have this but the dataset is printed on the items and the cashiers had to enter it by hand)

link

jaywalk 1464 days ago

Their system was basically designed the way you're saying, with a fallback to grab the data from the central location if it's missing locally. What you're asking for is the same system without a fallback, which doesn't make any sense.

link

kevan 1464 days ago

It's counterintuitive but when you're dealing with distributed systems lots of things are: https://aws.amazon.com/builders-library/avoiding-fallback-in...

link

bombcar 1464 days ago

Exactly - they had a fallback system that worked well enough for the testing to pass, but not well enough for the main system to operate on it.

Interestingly enough the Amazon example there is basically exactly what happened to Target.

link

bombcar 1464 days ago

The fallback was the problem - design it without it, or with a manual window that pops up saying "ITEM NOT FOUND, QUERY TARGET ORACLE" or something, and the fault wouldn't have taken down the whole company.

If suddenly every cashier is being forced to hit OK on every item, people would hear about it immediately from the test rollout instead of when it hit everything (of course, assuming you have good methods for detecting things like this and don't just completely ignore associates' complaints).

link