| Implementing webhook deliveries is one of those things that's way harder than you would initially imagine. The two things I always look for in systems like this are: 1. Can it handle deliveries to slow on unreliable endpoints? In particular, what happens if the external server is deliberately slow to respond - someone could be trying to crash your system by forcing it to deliver to slow-loading endpoints. 2. How are retries handled? If the server returns a 500, a good webhooks system will queue things up for re-delivery with an exponential backoff and try a few more times before giving up completely. Point 1. only matters if you are delivering webhooks to untrusted endpoints - systems like GitHub where anyone can sign up for hook deliveries. 2. is more important. https://github.com/xataio/pgstream/blob/bab0a8e665d37441351c... shows that the HTTP client can be configured with a timeout (which defaults to 10s https://github.com/xataio/pgstream/blob/bab0a8e665d37441351c... ) From looking at https://github.com/xataio/pgstream/blob/bab0a8e665d37441351c... it doesn't look like this system handles retries. Retrying would definitely be a useful addition. PostgreSQL is a great persistence store for recording failures and retry attempts, so that feature would be a good fit for this system. |
Even with a retry scheme, unless you are willing to retry indefinitely, there is always the problem of missed events to handle. If you need to handle this anyway, it might as well use this assumption to simplify the process and not attempt retries at all.
From a reliability perspective, it is hard to monitor whether the delivery is working if there is a variable event rate. Instead, designing it to have a synchronous endpoint the web hook server can call periodically to “catch up” has better reliability properties. Incidentally, this scheme handles at-most-once semantics pretty well, because the server can periodically catch up on all missed events.