Hacker News new | ask | show | jobs
by bazizbaziz 3494 days ago
How do people in production handle the possibility that your service might miss a webhook notification? If you miss a notification you'll end up with stale data and you won't know it.

Slack has a retry policy for a while but will then just give up. Another webhook provider I've looked at says nothing at all about this sort of thing. How do folks deal with this in production systems?

Seems to me like the best way to address this issue is to use the webhook as a hint that you need to run some other process that guarantees you've got all updates.

10 comments

When I was at IFTTT (a few years ago, so it's definitely changed since then) we tried not to rely on the content of the webhooks and just used them as a hint as you describe to fetch new data. Not every API made this easy though.

If receiving a webhook is critical, you should make your receiver do as little as possible to place the event into a resilient queueing system and then process them separately. That won't save you from bad DNS, TLS, etc. configs but it should help reduce the possibility that you DoS yourself with a flood of webhook events.

Also (shameless plug), you could monitor and log them (we offer retries if your server fails): https://www.runscope.com/product/alerts

I would prefer to implement the sending of webhooks in bulk - if the consumer falls behind, they receive up to 100-1000 webhooks per request (depending on the size and complexity of each individual webhook - ids only is 1000, complex documents 100). This drastically cuts down on the number of concurrent requests to a single client when load is high, or the consumer broke down for a period of time.

Unfortunately, developers writing code to receive batch requests are often... inadequate, to say the least. They'll write basic looping code without any error/exception handling; so if the 3rd item in a bulk request of 100 items causes a server-side error for them, they throw a 500 Internal Server Error or similar and fail to continue processing items 4 through 100. You simply cannot batch webhooks as a producer, unless you detect a single failure from the client to process a batch as a cue to drop to performing "batches" of size 1 until you receive an error for a single request, at which point you return to bulk. Rinse and repeat.

Honestly, being the producer sending webhooks to consumers which are written by random developers is a nightmare. You have to understand that your customers will not write proper code to accept your webhook requests, even if each request is for a single webhook. You also must understand that your customers will not look to blame themselves for shitty code. You can retry 1,000 times over a 48 hour period, and if their code still fails to process the webhook, it will be YOUR fault, not theirs. Truthfully, it is horrible to be on the sending end of webhooks to random developers/customers.

Transactions are obviously too enterprisey for fast moving unicorns; better spend 3 weeks to badly hack together a ridiculous farce.
I don't understand, if it's such a nightmare why don't you (the producer) create the code/libraries to use those webhooks? At least in the 2 most common platforms (e.g. PHP, Java)
Stripe has a retry policy as well.

You can set up something where it will alert you if there are too many failures in a certain time period. That isn't offered by Stripe but you can build it.

If you mean in the case of "catastrophic failure", there is none.

If there is a "catastrophic failure" (machine gets shut off for a week, data center blown up, whatever), there are probably bigger issues or we probably would already know.

Stripe has an "events" API that can be polled to receive the same content that you would have received via Webhook [1].

(Disclaimer: I work there.)

If you missed some Webhooks due to an application failure, it's possible to page through it and look for omissions. I've spoken to at least one person integrating who had this sort of setup running as a regular process to protect against the possibility of dropped Webhooks. This usually works pretty well, but does start to break down at very large scale where events are being created faster than you can page back.

The possibility of dropped events is a major disavantage of Webhooks in my mind -- if you consider other alternatives for streaming APIs like a Kafka/Kinesis-like stream (over HTTP) that's simply iterated through periodically with a cursor, you avoid this sort of degenerate case completely, and also get nice things like a vastly reduced number of total HTTP requests, and guaranteed event ordering.

(But to be clear, Webhooks are overall pretty good.)

[1] https://stripe.com/docs/api#events

Oh gosh, that is super neat! :)

I never even thought of using it that way. I just use events to check that it is a valid Stripe event (Probably easier / better to set up the ELB to only listen to certain addresses)

Some further related reading; Fowler talks polling for events in some of his Enterprise Integration stuff http://www.martinfowler.com/articles/enterpriseREST.html

EDIT: Not Fowler, but his site lol.

We[1] had a similar problem with clients reporting to us about lost callbacks[2] (our term for webhook). To solve it, we have built two options.

- Get a notification email everytime the callback fails. The email contains the same information the callback was supposed to deliver

- Retries. We retry for the next 24 hrs (max) with an interval of 5 mins or until the callback call succeeds (within those 24hrs). We created a sub-resource called `calls` (/callbacks/[id]/calls) that keep the status of the call we made. If it succeeds, the status changes to "SUCCESS", if it fails, it remains in "FAILED". If even after 24hrs the receiver system being down, and the call does not succeed, the developer can make a call to GET /callbacks/[id]/calls?status=FAILURE and receive all the failed calls. They can process the content and do a PUT /callbacks/[id]/calls?id=ID1&id=ID2&id=ID3... with body as `{ "status": "SUCCESS" }` to mark them as "SUCCESS".

The calls are saved for upto 7 days, so that the dev has enough time to fix their server issues, and get back all the lost callback calls. This solved much of the client issues.

* An added benefit of this came to the devs who could not get an inbound POST from us into their network due to firewall restrictions. The firewall restriction defeated the purpose of live callbacks, but with the `status` option, they only checked for new (`FAILED`) notifications once every 2 hrs or so , and mark the one processed with `SUCCESS`. This way, they only look for `FAILED` and process when they have one. Else, nothing to do.

[1] Whispir - https://www.whispir.com/ [2] https://whispir.github.io/api/#handling-callback-failures

I have recently moved all received webhooks to a job queue and have been very happy. you can retry the processing on your own terms.
This.

Previous devs were doing expensive things whenever we received webhooks. This meant we DoS'd ourselves every time a sizable amount of webhooks came our way.

Set up a tiny server on Heroku that received the webhooks and put them on a queue. A worker with a configurable concurrency level later forwards the events on the queue .

Dropped from four digit 502s and 504s weekly to virtually none.

Agreed.

It also allows you to do testing by injecting pre-cooked payloads into your queue system.

Maybe webhook providers could provide an endpoint where one could poll for events that failed to deliver.
The good APIs do, but it's still at a loss to both sides.

a) The producer of the events has to store them in semi-permanent storage. I've been there and done that - failed webhooks result in a table of tens of millions of rows, even if the memory on each event is only 48 hours. It's astounding how many events fail to process. And I've been through extensive verification that there is truly no problem on our side - it's always the client who is wrong. Emails back and forth for weeks with the client screaming "it's your fault!" - only to finally receive an "oops, we found the problem on our end... sorry".

b) Frankly, if the consumer of the events fails on a single webhook more than 5 times in a 24 hour period, that event is a permanent loss. The reason it fails consistently is because that specific event is a permanent failure to process on the consumer's side. It is probably throwing a 500 Internal Server Error or similar - every single time. 0.001% of webhook consumers actually have emergency alerts when webhooks fail on their end, so the job will continue to throw a silent/unlogged/unnoticed/ignored error no matter how many times you retry. These are the same type of developers who will never poll your "failure queue", because they don't even understand that their consumer endpoint throws 500 Internal Server Errors on 10% of your requests. You're trying to provide a service to developers that live in a fantasy world where errors and exceptions never happen on their end.

It's a simple fact that developers who consume webhook requests are a disgrace. Chances are that if a request fails two times, it will never succeed. And yet the best APIs will try hundreds/thousands of times over a 24 hour period - simply to prove to that client that it is their fault that they are not processing webhooks properly. There is only so much a webhook producer can do. There is no magic we can do if the consumer is copy/pasting PHP snippets from Google or Stackoverflow.

Story time. The most memorable situation I can remember is a client who was experiencing 100% webhook consumer failure for more than three weeks. The emails from their team - and subsequent phone calls from their CTO - were absolutely stunning; it got to the point that we were hounding our own business people to drop them as a client, the verbal abuse was that bad. Turns out they had a bunch of PHP developers who were for the first time writing their consumer webhook endpoint in C for some reason. They were trying to parse the custom "id" field that they sent us as a string in a JSON field, as an integer. It was all because they sent us a string, and choked on trying to re-interpret it as an integer. It hurts to even think about that case.

tldr; Fuck webhook consumers. Incompetent developers who don't know how to handle errors that are 100% their fault.

Funny aside: the most amusing cases come from PHP and .NET developers who expose their internal server errors in production. When you can copy/paste the response they gave you on a webhook because they are calling an undefined function or method... pure bliss.

You could also help customers who apparently have trouble properly connecting to your APIs by giving better error returns (got type A, expected type B), providing client libraries or giving more extensive support (for a price). Blaming the customer is easy, providing a way for even those "incompetent developers" to interface with you in a way that is easy to understand and debug for all parties is hard.
The truly great developers find a better way than only retrying webhooks and prepare a client library that the customer can just plug in to their code :-)
I like what Shopify does here - because your app is tied to a partner account, they can email you saying "this payload has failed 20 times in succession". If it fails too many times then the webhook is uninstalled.

Not to be snarky - but it's a distributed system. There's no way to guarantee you've got all updates! At a certain combination of latency and volume polling becomes impossible so webhooks (or something analogous) are all you've got :)

At a certain combination of latency and volume polling becomes impossible so webhooks (or something analogous) are all you've got :)

Isn't it the opposite? At a certain volume, when each polling request aways returns results, polling becomes more efficient than "interrupts". It's only at low volumes that webhooks are more efficient, since polling would have to issue a lot of requests with no response if a low latency is required.

Assuming here you mean something like a classic REST-alike "/events" endpoint which returns a bunch of stuff that's changed since the last time you requested it.

In that case, as the number of events grows, the HTTP transaction overhead goes to zero with polling, yeah.

But now you have a bunch of extra things which will impact your latency:

- The third-party service will do more work preparing the payload, meaning that the earliest event on the list no longer hits the wire right away

- related: someone might be holding a lock on event 63 of 100. Now other events have to wait for it before they can hit the wire

- In your application code, you may have to read the entire request before you can validate it or do anything with it (at least, this goes for APIs which speak JSON)

- You probably have to commit your transaction for the previous page of events before you can start your next request. Otherwise, whichever side of the network is keeping tabs on your current pointer in the list, that pointer may end up in the wrong place. Oops!

- If more events happen during the time it takes you to request a page than will fit on a page, then you're really stuck.

- An error anywhere in the super-http-transaction (network, user code...) now means that an entire page of updates has been delayed rather than just one.

It's possible to remove the sequential-ness constraint from our hypothetical "/events" but not without introducing other fun new problems.

By periodic reconciliation of the full dataset.
Yeah, I feel the best way is just for providers to give a RSS feed as the primary way of listing events and then notify with PubSubHubbub directly. Big advantage: everything already exists and is standard.
The easiest implementation would be a serial number. Then the client can check for holes in the number series.