Hacker News new | ask | show | jobs
by DarenWatson 44 days ago
A couple of years ago, we experienced a silent data corruption incident in our checkout process due to this specific edge case.

A user would generate the idempotency key by loading the front-end application, adding item(s) to their cart, submitting their order but timing out. The user would then navigate back to the front-end application and add another item and submit the order again. Since the user is submitting an identical idempotency key to the same transaction, our payment gateway would look up the request/transaction by idempotency key and see in its cache that there was a successful (200 OK) response to the previous request. The user now believes they purchased three items, however, our system only charged and shipped on two of the orders.

Consequently, the lesson we take away from the aforementioned incident is idempotency keys are really composite keys (Client_Provided_Key + Hash(Request_Payload)).

If a system receives an identical idempotency key (but with a different request payload) the idempotency key should be rejected with a 409 Conflict response with a message similar to "Idempotency key already used with different request payload". Alternatively, some teams argue it should be returned with a 400 Bad Request response. Systems should never return a failed cache response or replace old entries of data.

This article explains how to unlock your flow. The final idempotent key will not be located until the first request completes, but will rather exist when the request is in progress.

To safely accomplish your goal, you have to follow the following steps:

1. Acquire a distributed lock on the idempotent key.

2. Check for the existence of a key in your persistent store.

3. If an existing key is found, verify the hash of the payload against the hash for the payload type. If the hashes do not match, return a 409 error.

4. If the hashes match, look up the status of the payload. If the status shows COMPLETED in the persistent store, return the cached response. If the status shows PENDING in the persistent store, return a 429 Too Many Requests to the user or hold the connection open until the request reaches a PENDING state.

5. After processing the request, save the response to the persistent store before releasing the lock.

While this may look simple on paper, creating a distributed locking state machine for a single API endpoint is typically how developers have their first aha moments with idempotency. Becoming idempotent is often an enormous architectural shift and not just a middleware header check.

5 comments

This is incredibly poor advice. The bug was clearly in the client code, which did not understand the purpose or usage of the idempotency key. The API itself probably had a design flaw as well - it sounds like it needed a session or transaction id to serve the purpose you mentioned. That is not what an idempotency key is for.

An API should follow its documented behavior. This is both a specification and a contract. If the docs for the API say that a duplicate idempotency key will receive a 409, and do not mention message hashes, then they need to follow that spec because the client may specifically depend on it. For example if the order was processed and the cart is resent with the same key but an additional item, client does not want another order with the duplicate items in the first one. They want an error.

If the docs do not accurately describe the behavior of the idempotency key, the client should find another provider.

> While this may look simple on paper, creating a distributed locking state machine for a single API endpoint is typically how developers have their first aha moments with idempotency. Becoming idempotent is often an enormous architectural shift and not just a middleware header check.

Yes, when you expand the scope of your API implementation beyond its contract you take on a virtually unbounded amount of edge cases that not only must you solve, but that your customers must guess at how you are solving.

I'm guessing that your API required the idempotency key. I think that is could be risky because it means developers will simply provide a value for it without understanding the purpose, or thinking through the implications. You really only want them using it if they understand the problem it is solving.

Hashing message content could be an alternative behavior that it makes sense to support by default for apps that don't supply an idempotency key. As long as you document it.

Sounds like an interesting case of incorrectly trusting user input.

The idempotency key should have been viewed as the untrustworthy hint it really is. Then you can decide whether an untrustworthy hint is what you really need. At that point I'd hope someone on the team says "This is ordering - I think we need something trustworthy"

> Consequently, the lesson we take away from the aforementioned incident is idempotency keys are really composite keys (Client_Provided_Key + Hash(Request_Payload)).

Did the postmortem result in any other (wider) changes/actions, out of curiosity?

No idea if this was anything like what happened your case, and probably going off on a tangent, but I've seen so many cases where teams are split into backend and frontend, and they stop thinking about the product as a single distributed system (or, it exacerbates that lack of that thinking from before). Frontend often suggest "Oh we can just create an idempotency key" and any concerns from backend are dismissed. If they implement it incorrectly, backend are on the wrong 'team' to provide input.

I wholeheartedly agree. Luckily it is a lot easier to reliably run a distributed KV store that only needs to lock the idempotency key over relatively short times rather than a whole database with millions of records or make arbitrary systems idempotent.
> 5. After processing the request, save the response to the persistent store before releasing the lock.

Save only if the operation succeeds. It's meaningless to cache a failure, subsequent retries will result in failure from the cache.

Frankly you guys are overengineering the whole thing. We use the concept only for network outages i.e. it is only on timeout that we want to guard against fultilling duplicate request for the same operation.

>Client_Provided_Key + Hash(Request_Payload)

Congrats on destroying the purpose of Idempotency Keys.

Ask yourself, why not just `Hash(Request_Payload)`? That'll give you half of what you need to know about why the Idempotency Key header is useful in the first place.

The other half you already know? You just described your bug, it's a bug, on your front-end, this has nothing to do with idempotency; if anything, the system is performing as expected.

If your requests do something different, they should have different Idempotency Keys. <- this brings down TFA and most of the comments here. I guess those are the perils of vibecoding.