Hacker News new | ask | show | jobs
by felixhuttmann 1811 days ago
What I have always wondered about, in the stripe docs it says "Stripe's idempotency works by saving the resulting status code and body of the first request made for any given idempotency key, regardless of whether it succeeded or failed. Subsequent requests with the same key return the same result, including 500 errors." which indicates that the idempotency functionality is created in a kind of layer around the main application functionality, and thus the request from the idempotency layer to the main app is not itself idempotent. So the idempotency key protects against faults on the network between the client and stripe's idempotency layer, but not against stripe-internal faults between the idempotency layer and the application. Is that the case? Why is the idempotency not achieved by using an idempotent database transaction or atomic operation? What is the value of responding repeatedly with a 500 if the original 500 was caused by a transient error?

Adyen seems to similarly implement the idempotency in a separate data store (https://docs.adyen.com/development-resources/api-idempotency...).

Both stripe's and adyen's implementation seem to not treat transient faults within their own systems correctly.

4 comments

Many developers conflate idempotency and purity. It appears stripe is trying for purity, two requests always have the same response. This is likely misguided.

Idempotency would be like the API failed to generate the success response, but the backend did issue the money movement. On subsequent retries what matters is that it doesn’t do a duplicate money movement. The client ought to get a request indicating the money was successfully moved (since it was), perhaps with the earlier timestamp as a way of indicating it was already moved. Or it ought to get a special response that indicates the client can stop retrying

Sending the same error upon success sounds like the client has no way to know when to stop retrying. Sure it makes the API pure, but what problem does it even solve?

For Stripe, there's two headers that can help clients with retrying or not: "Idempotent-Replayed: true"[1] and "Stripe-Should-Retry"[2].

Whether they're helpful in all server-side error situations I'm not sure.

1: https://stripe.com/docs/idempotency#sending-idempotency-keys

2: https://stripe.com/docs/error-handling#the-stripe-should-ret...

"Should-Retry" is useful, but "Idempotent-Replayed" while might be somewhat useful for debugging, makes the idempotent semantics worse: i.e. the side-effect idempotency stays the same, while result isn't idempotent.

I guess in HTTP you may say that only the body of the HTTP response is the result, not the headers. But I implemented a similar system for GraphQL, and there are these boolean flags (isRetry and isReplayed, etc.) are part of the mutation inputs and/or payloads. The same with the "Idempotency-Key" itself, which is called clientMutationId in Relay mutation spec. The added value of not using HTTP headers is that it also works over websockets.

> So the idempotency key protects against faults on the network between the client and stripe's idempotency layer, but not against stripe-internal faults between the idempotency layer and the application. Is that the case?

Yes, in practice this is the case — the idempotency key insertion could succeed and then subsequent queries fail. Practically though, it doesn't happen very often — if a request gets as far as the idempotency layer successfully, the rest of it tends to work too.

Where it doesn't, Stripe pays a lot of attention to 500s, especially where those pertain to charges, and a lot of time and energy is spent cleaning up state that might have resulted in an invalid transaction.

> Why is the idempotency not achieved by using an idempotent database transaction or atomic operation?

The most honest answer is that Stripe wasn't built on a data store where transactions are supported, so it wasn't even a possibility until quite recently, and by then the system was already well established in its current form.

Beyond that though, once your requests are making their own requests to modify state in foreign systems (which is happening at Stripe in the form of banks, partners, internal systems, etc.), a single transaction isn't enough to keep things entirely in order anymore because it can't roll back that remote state. It is still possible to build a very robust system that is transaction-based, but it becomes a much more complex problem than a simple `BEGIN`/`COMMIT`.

> Practically though, it doesn't happen very often

It probably happens much more often than you think if you say this as someone with an insider perspective. I have had to integrate with banking systems and payment systems that use this approach, and it is extremely frustrating and comes of as a way of offloading work to the client. If a payment capture succeedes but subsequently always returns 500 an api client has to first query the status and then execute if-then logic for something that could easily just have been a retry of the request (3x the logic). This is acceptable since there are a million other ways to mess up idem potency so integrations kind of end up this way anyway. But the worst part of such an approach is debugging the issues as a client. A client cannot see the internal logs and therefore have to call support which will ALWAYS answer «the transaction seem fine with us» and basically just close the case to protect their KPIs. Im pretty sure this type of issue has a name but cannot find it (state client see dont match the real state). I dont mean to offend the work people put into these APIs, but I cannot see the good qualities of this approach other than saving development hours (and possibly saving one db query, but as an effect you get a status request plus another capture request as a workaround from your clients).

Just to clarify: I was speaking specifically about the case where you have a series of DB calls (like: auth user, retrieve account record, insert idempotency key, do more stuff), and the first one succeeds and the next ones fail. It can happen where the DB suddenly drops out as the request is executing, but it's more likely that it's either available or it isn't, so either the request succeeds, or it can't start.

And this comment was just meant to talk about faults with your own database. Once you are reaching out to other systems you see all kinds of problems regularly, but those tend to be handled more robustly because you kind of have to.

I might have misunderstood the scope of the idempotency layer, I thought it was this thin layer close to the web server that just replays the last answer from the services that are lower down. So that means stripe actually goes down to some store or external system to check the transaction status for subsequent calls with the same key?

I know how complicated integrating with stuff like the bank, 3Dsecure, AML and fraud prevention is. One of the systems I integrated with had no refund functionality, they expected a manually initiated bank transfer from our customer bank to the end user! So I certainly understand some states are irrecoverable in a web request. It is important to do though, because it saves all clients work for each automated recovery that can be done.

Thank you for the insightful answer.

I see the problem with mutations in foreign systems if those foreign systems do not support idempotency themselves. IMHO, though, stripe should abstract away faults in banks, and figure out how to work around faults in bank's systems using e.g. automated refunds when a duplicate charge is detected, and not just bubble up a 500 to stripe's customers and leave it to them to figure out. If stripe cannot figure it out in an automated way toward the bank whether the request suceeded, stripe's api customers certainly can't either, and stripe should risk double-charging the end customer knowing the the end customer will complain and request a chargeback.

> The most honest answer is that Stripe wasn't built on a data store where transactions are supported

Transactions are not necessary if one can do an insert conditional on the key not yet existing, but then it is required to have the idempotency key from the client enter into the primary key.

> IMHO, though, stripe should abstract away faults in banks, and figure out how to work around faults in bank's systems using e.g. automated refunds when a duplicate charge is detected, and not just bubble up a 500 to stripe's customers and leave it to them to figure out.

Yeah, this is what Stripe tries to do. Most problems during calls to foreign systems are handled in a ways that try hard not to send back an internal error. 500s are tracked carefully because they're painful to the user, but also because they leave behind potentially bad state that'll eventually cause problems internally and externally.

If they can be reconciled, a webhook will be fired to give the caller a more determinate answer (obviously less convenient for them to handle, but at least some sort of message makes its way back). More documentation on that here:

https://stripe.com/docs/error-handling#server-errors

> Transactions are not necessary if one can do an insert conditional on the key not yet existing, but then it is required to have the idempotency key from the client enter into the primary key.

This is how the implementation works more or less — an insert on a unique index that will error on a duplicate so you know it happened already. You'd probably implement it similarly in basically any major database whether Mongo, Postgres, MySQL, etc.

This is only a very small part of what transactions get you though — if a transaction-based request makes it midway through its lifecycle and then fails, it can roll back to a fresh slate. In a transaction-less system, you need to come up with some other answer for what to do what the partial state that was left behind.

I don’t speak for stripe, but do have similar experience with idempotent APIs.

> the idempotency functionality is created in a kind of layer around the main application functionality …

Yes. Think of it as a kind of request/response handler cache that doesn’t need to be deeply tied to the operation logic. On the request path insert an entry with a compound key for the “idempotency parameters” like client identity, operation, idempotency token, and request parameters. You can hash these if needed. On the response path update your idempotency cache with the response value. If you the underlying operation times out etc you can synthesize and insert the exception response. On subsequent requests read the idempotency key(s) from your cache and return the cached response if it exists.

> So the idempotency key protects against faults on the network between the client and stripe's idempotency layer, but not against stripe-internal faults between the idempotency layer and the application.

Maybe. The external idempotency token is there for client retries. Internally there’s idempotency all the way down to somewhere you’re (hopefully) doing a compare and swap, conditional put, or similar. A subsequent request that conflicts should be detected based on that state.

> Why is the idempotency not achieved by using an idempotent database transaction or atomic operation?

Beyond small examples you probably can’t. You end up calling multiple other services or have distributed state of some sort for most interesting operations. And distributed transactions are expensive/difficult/painful. So plumbing idempotency and error handling workflows through is simpler.

these seem extreme for most apis but are likely very good at preventing double charging. a “normal” api call usually costs nothing. i bet stripe and adyen have apis with an average cost per call (to someone) being like $20-$30

the cost to undo a charge is high in people and reputation

How could their API possibly cost $20-$30 per call? How could that even be a business model? Clearly, I am missing something here.
I suspect the OP meant to say charge instead of cost.
ah that makes sense.
unrelated but fun fact: AWS CloudHSM v1 (deprecated now) had a $5,000 api call. that was the cost to create a cluster.
Don’t some bank system transfer like the U.K. cost like ~£25 per transaction?