Hacker News new | ask | show | jobs
by stickfigure 47 days ago
This is all way too much. If you see a duplicate idempotency key, skip the replay and always return 409. This becomes a client problem. Clients already need to help enforce idempotent contracts; "check for conflict response" is not an onerous imposition.

I've built multiple ecommerce APIs with this approach and they work great. No heroic measures required. You can often satisfy this contract with a unique constraint; if not, a simple presence check in redis. No hashing or worrying about PII.

My rant about this: https://github.com/stickfigure/blog/wiki/How-to-%28and-how-n...

8 comments

But that's not idempotent? If I'm a client and I don't know if the original request went through, getting a 409 on any subsequent requests tells me nothing about whether the original request was successful or not.
Retries will only receive 409 if the original request was successful. If the original request failed, the server performs the operation as normal on the second request. It doesn't replay failures.

The whole point of the idempotence mechanism is so you can make a reliable distributed system. If the first try fails, the client doesn't know if it succeeded or not, so the client should try again later ("at-least-once"). The idempotence mechanism just ensures that we don't get duplicates in the case that the first try actually succeeded.

If you replayed failures there wouldn't be any point to the idempotency key.

What if the original request is still being processed when the retry comes in? That doesn't fall into either of your categories: the request isn't successful, but it hasn't failed either.
That’s because everyone here thinking about payments completely incorrectly. They’re not atomic and your server shouldn’t pretend they are.

You need to store the payment state at each relevant step and process it asynchronously. If requests time out, you check the status of it using the key you store (with the processor) to see if it was even received.

It’s not perfect, some processors will 500 while processing the payment (Braintree), so you still need reconciliation on the backend.

I'm not as smart as you folks but doesn't durable execution being a part of the solution help a lot in this respect?
You still return 409, with error detail in the response body, because it is still true that the client sent multiple requests with the same idempotency key, regardless of the progress of the original request.

Regardless, I think your assumption about how the request/response cycle should be working is wrong. For this kind of API and transaction, the server should be returning a response immediately: 202 Accepted. The only thing the API server should be doing before returning is creating a row in a DB (with a "state" field with an initial value of "pending"), and pushing some work on a queue.

The server should not be sitting there with the HTTP request open, trying to complete the transaction, and only returning a response to the client when the transaction is finished or has encountered an error.

The client will have to learn about the progress of the state of the transaction outside of this initial request. There are many options here: polling, webhooks, a message queue like kinesis or kafka, etc.

Ah you see, you just essentially moved unique constraint to a client. It is no different generating an ID on client side and making a request. It is not idempotency.

Idempotency-Key should not replay the response (it depends, actually). But also it should not error 409. You need to be content aware before adding Idemmpotency Key header handling.

What will happen when the request is received and handled but during writing response body TCP connection dropped unexpectedly. And after second or two a connection reestablished. How two sides agree that previous request accepted and everything good to go? That's what Idempotency-Key header does.

By "request" I didn't necessarily mean the HTTP request sent by the client, and I don't think the post I was responding to did either. But I agree my use of terminology was ambiguous. Let me restate more clearly, and in a way that shows the issue under your request/response process:

An HTTP request comes in with a certain idempotency key. The server returns 202, as you say, and begins to process the database transaction.

While the server is still procesing the database transaction, a second HTTP request comes in with the same idempotency key. What response does this second HTTP request get? The original transaction that the first HTTP request triggered hasn't succeeded and hasn't failed, so it doesn't fall into either of the categories in the post I responded to.

Your answer is that the second HTTP request gets a 409, which makes sense to me, although others are objecting to it.

Best practice is to keep database transactions short. There is no value in returning a "hey this is in progress" error code to a client when transactions are short, and databases don't easily give you this primitive anyway.

You seem very focused on long-running orchestration type systems. You build these on top of basic transactional primitives, but it's a mistake to try to make the whole process a single transaction. You can have a quick, transactional "start process" operation which must be idempotent. Other operations like "check status" need not be so complicated.

You don't necessarily share the idempotency key between the "start process" request and the "check status" request. You could for convenience, but it isn't necessary, and on balance most APIs don't. This is the "client picks ID" vs "server picks ID" design choice.

> You still return 409

No no no no no.

You have multiple clients submitting the same business operation simultaneously. One must succeed, the others must fail. If you're using the 409 approach ("notify client that request is redundant") you must not send a 409 code until the work is complete.

The client must interpret 200 and 409 as success cases. 200 means "it was done" and 409 means "it was already done". Clients looping (say, processing durable queue messages) can stop when they receive these responses.

If the work is not complete, you can't return 409, or clients will think the work is done. You will lose messages.

What's the best practice here? It's trying to represent as binary a multi-state operation, but the redundant clients should check the response's body to know why it 409'd. If the process is slow, it can't return a 200 immediately, and yet it should return 409 to all other attempts, even if the initial attempt ends up unsuccessful.
> You have multiple clients submitting the same business operation simultaneously.

It doesn't have to be multiple clients. It could be the same client, not having received a response to its first request and deciding to re-send the request again.

Being charitable, I'd say the poster above is saying that in the web architecture you can (should?) shift more of the burden for idempotence to the client.

But, rather than 409, I'd say that you should be using opportunistic concurrency control if you adopt this perspective. There should be a resource context for the request, so the client can obtain an ETag and send If-None-Match headers, and get a 412 response if things are out of sync. That allows them to retry a failed/lost request and safely prevent a double action.

Under a 412, they have to step back and retry a larger loop where they GET some new state and prepare a new action. Just like in DB transaction programming, where your failed commit means you roll back, clean the slate, and start a whole new interrogation of transaction-protected state leading up to your new mutation request.

The client is already participating in a transaction in a distributed system. There is no way to change the reality of that. Suggestions about masking this only make the composite system unsound and will not improve net service reliability improvement.

That doesn't mean that idempotency keys have to be used. You can certainly hash message content if that is documented behavior. That probably only makes sense when there is already some logical session or transaction identifier that makes dedupe semantics clear.

The system you propose might be sound and might be necessary in some systems, but I can't think of what they might be that wouldn't be better served by the simpler solution that is already widely used for this purpose.

Your database should not allow both commits to happen - one should get rolled back.

If it processed 99% of the request and the final bookkeeping failed because of a duplicate, that's still a failed request.

Arguably this should be the primary way you check for idempotent requests - you shouldn't have a separate check for existence, you should have the insert/update fail atomically.

This is the same thing you see on filesystems for TOCTOU security holes - the right way is to atomically access and modify once, and you only know the request was already processed because that fails.

Not in the payments world. If you’re 99% done but only the bookkeeping failed, then it’s likely that money is changing hands and you need to deal with that fact. Payments are not an atomic infrastructure and you cannot magic that into reality.
Payments are multistep but each of the steps needs to be atomic. The "create payment" operation must be transactional and the communication channel between you and the processor must be idempotent so you don't inadvertently create multiple payments.

The fact that payments have a settlement process is not relevant to this discussion.

That's usually solved with traditional database transactions.

Even if you have a complex long-running multistep orchestration problem, you can break it down into simpler transactions. Eg you could start with a "lock the resources" txn.

But 99% of these conversations around idempotence are simple POST operations like "create order" that regular old database concurrency management handles just fine.

> That's usually solved with traditional database transactions.

That doesn't answer my question. What response do you return to the client in the case I described?

This is just normal concurrent programming? If two requests come in for the same idempotency key/customer reference id, only one will succeed. Use standard database transaction isolation.

So one will complete with 200, one will complete with 409. It doesn't matter which.

That said, there's something odd about the way you phrased this question. If the original request hasn't gotten a response yet, why is it sending a retry? What you're asking is more general: What happens when two conflicting requests come in? This is something we've been solving with RDBMSes since the 1970s.

The lock would normally make the second request wait, aka not return a response, until the first one is done. Then it sees it's a duplicate and returns that. Or it times out and returns an error. Then the client hopefully have some exponential back off strategy, so the third attempt doesn't suffer the same fate.
To be honest, I liked your original response about returning a 409 - it's not something I'd done before and I like how it keeps things simpler.

But your follow up responses here are making me rethink. Now you have to have all these special cases where the original request is still in process. I think or assertion of "99% are simple POST operations" is bullshit. For the times where idempotency is hard and really matters, often times you're calling a third party API, like a payment processing API.

I would think a better approach would be to always return a 409 on a subsequent request, regardless of whether it passed or failed, and then have a separate standard API that lets you get the result of any request by its idempotency key.

Idempotence already requires some client thinking if you want to conform to HTTP specs.

I.e. idempotent DELETE with proper protocol behavior requires that one request see the 200 OK or 204 No Content and the other sees 404 Not Found, because the delete has already happened. It would be misleading to say 200 OK to both, because that answer means the resource was there when the request arrived.

Honestly, the whole HTTP resource model has a different conceptual backing for state management than the independently developed "idempotence" concepts in distributed systems. Those non-HTTP concepts came from more message-based rather than resource-based architectural assumptions.

The cleanest mapping in the spirit of HTTP would be that you do multiple round trips. A POST creates a new idempotence context, a bit like "start a transaction". The new URI is the key for coordinating state change and allowing restart/recovery.

As I remember it, the idea of idempotence keys in headers really came from the SOAP RPC mindset. It's kind of funny to see it persisting in some hybrid SOAP + REST mental model.

> special cases where the original request is still in process

This isn't a special case, and it's the same problem if you want to replay the original response on conflict. If the original request isn't complete, what are you going to replay?

You have a transaction row lock and you tell the subsequent requests to f off. I would go as far as returning a 429.

Devs are too scared to be nice (ie not return errors) to clients when they misbehave.

You are improvising, and in the process, changing the semantics of a well established design pattern. Do not recommend.
Your comment contributes nothing to the discourse. It's FUD.

The pattern I describe was the dominant design pattern for financial transaction processing systems before Stripe. Stripe's API makes life for the clients slightly easier at the expense of making life for servers more complicated, but the two approaches are equivalent in function.

The topic is about idempotency though, and what you are describing is not idempotency. You are describing something different, and arguing “but it accomplishes a similar thing.” It appears you have come to the same conclusion as the article that building idempotency isn’t trivial.
It absolutely is idempotent behavior of the system. The goal is "make an idempotent payment", and this approach ("return an error for duplicates") was standard for financial APIs before Stripe. It still dominates for ecommerce APIs.

It isn't complicated, though I can see how if your entire experience with financial APIs is Stripe, you might not be aware of how simple it is. Because Stripe's approach, while mildly more convenient for clients, is a PITA to implement properly.

You provide in the response body a JSON blob or something that indicates a detailed error code, like DUPLICATE_IDEMPOTENCY_KEY, with some more information that points to the record ID of the corresponding transaction that the client can fetch.

Sure, a strict reading of "idempotence" might require that the response for subsequent requests be identical to the first, but for practical concerns, what matters is the API contract you define, document, and adhere to. The purpose of idempotence is to ensure that you don't end up with duplicate transactions. That's what actually matters. How that's represented in the protocol is an implementation detail.

If you're a client using the same idempotency key for a materially different request you have a bug.
Yes, and if you are building a payment API you need to be robust to client bugs.
Rejecting the conflicting request is being robust to improperly reused idempotency keys. There's no other reasonable answer.
The client is part of a distributed transaction. It can't be oblivious to this. Clear semantics and accurate adherence to them is the only answer that doesn't make the overall system unsound. Client bugs are expected and so the simplest semantics that ensure data integrity and accurate responses are the best way to help them identify and fix their bugs.
The 409 should come with an id or a url the client can use to find the original result.
...and if you're using the approach of "let the client pick ids", you don't even need that. The client has everything it needs.
I really like that article, thank you for sharing it again. I wish I had read it a decade ago, or even the first time you submitted it - I had to learn some of these things the hard way.

I agree with you - I don't think Stripe has made the right choices here and its unfortunate that it has inspired so many other people to make the same poor choices. I don't agree that their system is as sound as always returning 409s. I think having a short window where you return response bodies is fine, but after that they should still be sending 409s. If no one will ever actually resend a request after 24 hours how is it not fine to send 409s when they do? They've chosen to implement the more expensive choice and then not back it with the cheapest one.

I really like this approach, and am going to keep it in my back pocket.

However, it's good for e-commerce where there are a subset of "important" operations, but I'd argue the idempotency key is better for a financial API like Stripe where the majority of your operations need these semantics.

You also run into a problem if PUT/PATCH needs to be exactly-once instead of at least once. Not as common, but again, might be something you run into with a financial API.

There's nothing special about any particular HTTP method; it is the inherent nature of distributed systems that you can not have exactly-once semantics with a single HTTP request. To create exactly-once semantics, you need a client which retries until success and a server which prevents duplicates.

The only difference between the approach I'm describing and Stripe's approach is the detail of how the client knows that it's done. "My" mechanism (notify the client that a request is a duplicate) was dominant in financial transaction processing systems before Stripe; I used probably a half dozen of them back in the day.

Stripe came along and from the beginning decided "we want to compete based on making a friendly API". They succeeded, and if you ever look at Paypal's original API, it's easy to see why. It's true that repeating successful API responses makes life slightly easier for clients; they never need to check for a 409 (or whatever) response. It comes at a cost of making life quite a bit more complex on servers. Personally I don't think the tradeoff is worth it, but YMMV. If your single competitive advantage is "easy API" maybe it makes sense. If you're normal B2B, almost certainly not.

FWIW, I enjoyed your engagement throughout the thread above, fighting the good fight.

I think there is also a risk of an "easy API" when it leads to magical thinking and sloppy development. If the naive client programmer starts to think the reliability is handled for them, they may also flub the handling of the idempotence key that remains the crux here. E.g. not persisting them well enough and returning to a situation where the user can accidentally make duplicate payments just like the naive system with no idempotence feature...

Exactly - junior engineers haven't yet developed a good taste of "how much the system should try and deal with errors" - so they always go way overboard to guess and try and fix errors way too much. I agree with you - this is a client error - 409 and make them fix the bug on their side. Clean, reliable, simple, separation of responsibility.
And a good way to convey this contract is force the clients to hash-chain the calls, so is now more clear that it should do (and even can detect the send will fail at client side)
Thank you, like and agree with the rest of the rules
> If you see a duplicate idempotency key

I'm no expert but an "idempotency key" already has some major smell to it.

Then learn from the experts, maybe? https://docs.stripe.com/api/idempotent_requests
Why? I've used plenty of systems that have an idempotency key. E.g. many payment processors will take an order id and won't let you charge an order id more than once. That's just an idempotency key by another name.
This is because there are countless tutorials and slop on the internet that says, no, instructs, you to handle idempotency on the client instead of where it belongs. Server-side.