Hacker News new | ask | show | jobs
Ask HN: How do you monitor and retry failed webhooks in production?
4 points by GoatPerfect 117 days ago
I’ve been working on a project where webhooks are a core part of the system, and I realized how fragile they can be in practice.

Transient network errors, timeouts, downstream issues — things fail more often than expected.

I’m curious how others are handling this in production.

Are you building custom retry logic?

Using a queue?

Relying on provider retries?

Just logging and manually checking failures?

Do you monitor webhook delivery rates or alert on repeated failures?

Would love to hear what setups people are using and what’s worked (or not worked) for you.

3 comments

We treat webhooks as at-least-once delivery over an unreliable transport and design for duplicates and out-of-order events.

A few rules that have saved us:

- Persist before responding. Never process inline. Write payload to DB, return 200 fast.

- Idempotency key required. Either provider event ID or hash the payload.

- Async worker processes from queue. Exponential backoff + max attempts.

- Dead letter queue + dashboard. Humans need visibility.

- Alert on backlog growth, not single failures. One failure is noise. A growing retry queue is signal.

- Relying on provider retries alone has bitten us more than once.

Thank you so much for tips! I was feeling nervous about relying on provider retires as well. I especially like the idea of alerting on backlog growth. There's nothing I hate more than a bunch of emails and notifications!
This was a nice goat exchange
We receive the webhook, return 200 immediately, and push the payload to a message queue for processing. That way you own the retry logic, can inspect stuck messages, and DLQ alerts handle repeated failures automatically.

Idempotency becomes your responsibility, though, since messages can be delivered more than once.

Have you checked out https://svix.com? No affiliation, I just like the product. Might also check out https://www.standardwebhooks.com/
I just checked them out! Looks like it would make handling failures a breeze!