Hacker News new | ask | show | jobs
by code-e 1161 days ago
As the maintainer of a rabbitmq client library (not the golang one mentioned in the article) the bit about dealing with reconnections really range true. Something about the AMQP protocol seems to make library authors just... avoid dealing with it, forcing the work onto users, or wrapper libraries. It's a real frustration across languages, golang, python, JS, etc. Retry/reconnect is built in to HTTP libraries, and database drivers. Why don't more authors consider this a core component of a RabbitMQ client?
4 comments

AMQP (the protocol) basically requires this kind of behavior. Whenever anything "bad" happens, the state of the session becomes unknown (was the last message actually sent? was the last ack actually received? etc), and the only way to correct it is to kill the connection, create a new session, and start over.

Fixing that really needs to be at the protocol level, like a way to re-establish a previous session, or rollback the session state, or something. It's definitely hard mode for library authors to fix this in any kind of transparent way.

The qpid client libraries supported automatic transparent reconnection attempts, but in the end I usually had to disable them in order to add logic for what to do after reconnecting. IE, I needed to know the connection was lost in order to handle it anyways.

Nats has all of these features built in, and is a small standalone binary with optional persistence. I still don’t understand why it’s not more popular.
This also occurs when dealing with channel-level protocol exceptions, so this behavior is doubly important to get right. I think one of the hard parts here is that the client needs to be aware of these events in order to ensure that application level consistency requirements are being kept. The other part is that most of the client libraries I have seen are very imperative. It's much easier to handle retries at the library level when the client has specified what structures need to be declared again during a reconnect/channel-recreation.
Lol, your comment resonates deeply with me.

We've had RabbitMQ as part of our stack at my day job since time began, I think it's great software overall but boy are the client libraries a challenge.

We've built a generalised abstraction around first Pika and then pyamqp (because Pika had some odd issues, I forget the details of which) and while pyamqp seems better, it's still not without its odd warts.

We ended up needing to develop a watchdog to wrap invocation of amqp.Connection.drain_events(timeout: int) because despite using the timeout, that call would very occasionally inexplicably block forever (with the only way to break it free being to call amqp.Connector.collect()).

My other data point was a time I built something to slice off a copy of production data for testing purposes (from instances of the system above) using Benthos (pretty cool software tbh, Go underneath), but it would inexplicably just stop consuming messages and I had no idea why (so I just went back to our gross but proven Python abstraction to achieve the same).