Hacker News new | ask | show | jobs
by brandur 1813 days ago
> So the idempotency key protects against faults on the network between the client and stripe's idempotency layer, but not against stripe-internal faults between the idempotency layer and the application. Is that the case?

Yes, in practice this is the case — the idempotency key insertion could succeed and then subsequent queries fail. Practically though, it doesn't happen very often — if a request gets as far as the idempotency layer successfully, the rest of it tends to work too.

Where it doesn't, Stripe pays a lot of attention to 500s, especially where those pertain to charges, and a lot of time and energy is spent cleaning up state that might have resulted in an invalid transaction.

> Why is the idempotency not achieved by using an idempotent database transaction or atomic operation?

The most honest answer is that Stripe wasn't built on a data store where transactions are supported, so it wasn't even a possibility until quite recently, and by then the system was already well established in its current form.

Beyond that though, once your requests are making their own requests to modify state in foreign systems (which is happening at Stripe in the form of banks, partners, internal systems, etc.), a single transaction isn't enough to keep things entirely in order anymore because it can't roll back that remote state. It is still possible to build a very robust system that is transaction-based, but it becomes a much more complex problem than a simple `BEGIN`/`COMMIT`.

2 comments

> Practically though, it doesn't happen very often

It probably happens much more often than you think if you say this as someone with an insider perspective. I have had to integrate with banking systems and payment systems that use this approach, and it is extremely frustrating and comes of as a way of offloading work to the client. If a payment capture succeedes but subsequently always returns 500 an api client has to first query the status and then execute if-then logic for something that could easily just have been a retry of the request (3x the logic). This is acceptable since there are a million other ways to mess up idem potency so integrations kind of end up this way anyway. But the worst part of such an approach is debugging the issues as a client. A client cannot see the internal logs and therefore have to call support which will ALWAYS answer «the transaction seem fine with us» and basically just close the case to protect their KPIs. Im pretty sure this type of issue has a name but cannot find it (state client see dont match the real state). I dont mean to offend the work people put into these APIs, but I cannot see the good qualities of this approach other than saving development hours (and possibly saving one db query, but as an effect you get a status request plus another capture request as a workaround from your clients).

Just to clarify: I was speaking specifically about the case where you have a series of DB calls (like: auth user, retrieve account record, insert idempotency key, do more stuff), and the first one succeeds and the next ones fail. It can happen where the DB suddenly drops out as the request is executing, but it's more likely that it's either available or it isn't, so either the request succeeds, or it can't start.

And this comment was just meant to talk about faults with your own database. Once you are reaching out to other systems you see all kinds of problems regularly, but those tend to be handled more robustly because you kind of have to.

I might have misunderstood the scope of the idempotency layer, I thought it was this thin layer close to the web server that just replays the last answer from the services that are lower down. So that means stripe actually goes down to some store or external system to check the transaction status for subsequent calls with the same key?

I know how complicated integrating with stuff like the bank, 3Dsecure, AML and fraud prevention is. One of the systems I integrated with had no refund functionality, they expected a manually initiated bank transfer from our customer bank to the end user! So I certainly understand some states are irrecoverable in a web request. It is important to do though, because it saves all clients work for each automated recovery that can be done.

Thank you for the insightful answer.

I see the problem with mutations in foreign systems if those foreign systems do not support idempotency themselves. IMHO, though, stripe should abstract away faults in banks, and figure out how to work around faults in bank's systems using e.g. automated refunds when a duplicate charge is detected, and not just bubble up a 500 to stripe's customers and leave it to them to figure out. If stripe cannot figure it out in an automated way toward the bank whether the request suceeded, stripe's api customers certainly can't either, and stripe should risk double-charging the end customer knowing the the end customer will complain and request a chargeback.

> The most honest answer is that Stripe wasn't built on a data store where transactions are supported

Transactions are not necessary if one can do an insert conditional on the key not yet existing, but then it is required to have the idempotency key from the client enter into the primary key.

> IMHO, though, stripe should abstract away faults in banks, and figure out how to work around faults in bank's systems using e.g. automated refunds when a duplicate charge is detected, and not just bubble up a 500 to stripe's customers and leave it to them to figure out.

Yeah, this is what Stripe tries to do. Most problems during calls to foreign systems are handled in a ways that try hard not to send back an internal error. 500s are tracked carefully because they're painful to the user, but also because they leave behind potentially bad state that'll eventually cause problems internally and externally.

If they can be reconciled, a webhook will be fired to give the caller a more determinate answer (obviously less convenient for them to handle, but at least some sort of message makes its way back). More documentation on that here:

https://stripe.com/docs/error-handling#server-errors

> Transactions are not necessary if one can do an insert conditional on the key not yet existing, but then it is required to have the idempotency key from the client enter into the primary key.

This is how the implementation works more or less — an insert on a unique index that will error on a duplicate so you know it happened already. You'd probably implement it similarly in basically any major database whether Mongo, Postgres, MySQL, etc.

This is only a very small part of what transactions get you though — if a transaction-based request makes it midway through its lifecycle and then fails, it can roll back to a fresh slate. In a transaction-less system, you need to come up with some other answer for what to do what the partial state that was left behind.